10.02.26

Error rate comes too late

Error rate is one of those metrics that feels solid. When it is low, everyone relaxes. When it grows, everyone moves. I used to read it the same way, until I saw how late it can be.

In several systems I worked with, error rate stayed quiet while the system was already under serious stress. The most telling case for me was a cascading bottleneck, where errors appeared only at the very end, but the real damage started much earlier.


Errors appear when the system can no longer handle the load

An error is not the first sign of trouble. It is the moment when the system stops handling pressure.

Before that moment, things usually look acceptable. Responses are slower, but still successful. Queues grow, but not enough to alert anyone. Retries start to matter, but they are still invisible on high-level dashboards.

In my case, one service in the middle of the flow became slow under load. It did not fail. It just responded later and later. That delay was enough to change how the rest of the system behaved.


Cascading failures grow quietly, then all at once

What made this situation hard to see was the cascade.

The slow service increased latency for the next service in the chain. That service started to build up concurrency. It still returned responses, but with more effort. The last service in the cascade was the first one that actually started returning errors.

At that moment, error rate finally moved. But that was not the worst part.

As soon as the last service started failing, the services before it reacted. They retried. From their point of view, this was the right thing to do. A request failed, so they tried again.

Those retries multiplied traffic. The same requests were sent again and again through already stressed services. Load grew not linearly, but much faster. What started as one slow service turned into pressure on the entire system.

Very quickly, it was no longer about one service failing. Everything was overloaded.


Error rate confirms collapse, not its cause

From monitoring, this looked like a sudden incident. Error rate spiked across multiple services. It was tempting to think that the last service caused the outage.

But that was not true. The last service only exposed the problem. The real issue was the earlier slowdown and the retry behavior that amplified it.

By the time error rate reacted, the system was already distorted. Queues were full. Threads were busy. Retries were eating capacity. The original bottleneck was buried under symptoms.

Error rate told us that users were affected. It did not tell us why the system collapsed.


Calm dashboards can hide unstable systems

What stays with me from that case is how calm things looked just before the failure. Error rate was fine. Nothing was red. Yet the system was already unstable.

The cascade showed me that low error rate does not mean low risk. It can mean that the system is spending all its effort on hiding problems.

Once hiding stops working, failure is fast and wide.

But not every system behaves this way. Not every retry strategy creates exponential load. Some systems have strict backpressure and fail early.


Fixed position

I am not generalizing this experience to all architectures. I am fixing a boundary around how I read error rate. For me, it marks the moment when control is already lost.

If error rate is the first thing you notice, the system has already made decisions that are hard to undo. The important part happened earlier, quietly, while everything still looked fine.

That is why I read error rate as the end of the story, not the beginning.