When a Node.js stopped scaling

At that time, we had a Node.js BFF service. It sat between the frontend and several backend services. Its role was simple: aggregate responses, apply some logic, return data.

Nothing about it looked fragile. Nothing suggested an early limit.

At some point, we ran load tests.

When the system stopped scaling

The service handled around 250 requests per second. Beyond that, response times increased quickly and errors started to appear.

What mattered was not the number itself.

CPU was not fully utilized.
Memory was stable.
Autoscaling behaved as expected.

From the outside, the system looked healthy. From the client side, it was clearly not.

What I did not rush to change

There were many obvious next steps.

Increase the number of pods.
Adjust resource limits.
Tune garbage collection.
Add caching.

All of these actions are familiar. They also move the system before its boundary is understood.

I paused, not because I knew the answer, but because I did not yet know the constraint.

Narrowing the question

Instead of asking how to improve the system, I focused on a narrower question.

What limits this service if traffic keeps growing?

Node.js executes JavaScript on a single main thread. This was not new information. What mattered was how much work passed through that thread.

The BFF handled incoming requests, aggregated multiple backend responses, parsed JSON, and executed logic that included synchronous steps. All of this work went through the same event loop.

Under enough concurrency, that loop became the bottleneck. Not because of an error. Not because of inefficient code. Because that was the execution model.

Where the limit actually was

There was no broken component. The system reached a limit that had existed from the beginning.

We assumed that scaling meant adding more pods. Inside each pod, however, the work was still serialized. One CPU core was busy, while the others were mostly idle. At around 250 requests per second, the event loop could no longer keep up. Requests started to queue. Latency grew. Errors followed.

Nothing unexpected happened. The system behaved according to its design.

The change that removed the bottleneck

Once the limit was clear, the change was straightforward. We introduced clustering and ran multiple Node.js workers per pod, aligned with the number of CPU cores. Each worker had its own event loop. The code did not meaningfully change. The logic stayed the same.

What changed was how work was executed. After that, throughput went beyond 2000 requests per second, CPU usage became balanced, and response times stabilized.

No optimization was involved in the usual sense. The system was simply allowed to use the resources it already had.

Why I keep this case

I return to this case from time to time because it is quiet.

Nothing failed loudly.
Nothing crashed.
No metric directly pointed to the cause.

The limit was present from the start. Load testing did not create it. It only made it visible.

This is how many performance problems appear in real systems. They surface when earlier decisions meet real traffic, not when something suddenly breaks.

The fix took little time. Understanding where the system stopped scaling took longer. That difference is why I keep this case as a reference point for myself.