Java Thread Leak: Finding the Root Cause and Fixing It
A story about how a single unclosed ExecutorService in Java caused native memory leaks and JVM crashes. And how it was fixed.
🧵
Introduction
The project was a large e-commerce and healthcare client from the US (sorry, NDA — won’t share the name).
The stability of this module was crucial for key business processes, so the goal of the test was simple: make sure the system can handle long and heavy load without degradation.
In practice, things turned out to be a bit more interesting. During a performance run one of the services in Kubernetes started crashing from time to time with a
fatal Java Runtime Environment error
Kubernetes kept restarting it, but after a while, the crash happened again.
At first, it looked like a classic memory leak, but the heap was clean. The real problem hid deeper — in native memory and threads.
Contents
Test Context
| Parameter | Value |
|---|---|
| Platform: | Kubernetes |
| Language: | Java 17 |
| Service: | pricing component (🔒 name hidden) |
| Test type: | Capacity |
| Goal: | gradually increase the load to find the capacity point — the moment when the system reaches its performance peak before degrading |
First Analysis and Hypotheses
CPU usage grew slowly and linearly, but memory kept growing until it hit the limit every time.

To rule out external factors, I ran the same service locally in Docker Desktop and launched the same test again.
The result was identical: memory grew steadily, CPU remained stable. So the service wasn’t overloaded with computations — memory was leaking.

Alright, so it wasn’t the infrastructure — it was native memory. And the JVM logs confirmed that:
A fatal error has been detected by the Java Runtime Environment:
Native memory allocation (mprotect) failed to map memory for guard stack pages
Attempt to protect stack guard pages failed
os::commit_memory(...): error='Not enough space' (errno=12)
In plain English:
There were too many threads, and the OS simply ran out of space to allocate stack memory for new ones.
Finding the Cause
Looking through the code quickly led to a suspicious part: inside the enrich() method, a new ExecutorService was created on every call.
It looked neat — CompletableFuture, async calls, familiar pattern. But the thread pool was never closed.
❌ Before Fix
ExecutorService executor = Executors.newFixedThreadPool(WORKERS);
CompletableFuture<Void> prices = CompletableFuture.runAsync(
() -> enrichPrices(context, productIds), executor);
CompletableFuture<Void> products = CompletableFuture.runAsync(
() -> enrichProducts(context, productIds), executor);
joinAndRethrow(prices, products);
Each request created a new thread pool. The threads stayed alive and gradually filled up the memory until the JVM hit its limit.
The Fix
The fix was simple — add a try/finally block and close the pool after work was done.
Why shutdown() and not shutdownNow()? Because shutdown() lets ongoing tasks finish gracefully. shutdownNow() kills them instantly, which might lead to data loss. Reusing the pool wasn’t necessary here — the tasks were short and independent.
✅ After Fix
ExecutorService executor = Executors.newFixedThreadPool(WORKERS);
try {
CompletableFuture<Void> prices = CompletableFuture.runAsync(
() -> enrichPrices(context, productIds), executor);
CompletableFuture<Void> products = CompletableFuture.runAsync(
() -> enrichProducts(context, productIds), executor);
joinAndRethrow(prices, products);
} finally {
executor.shutdown();
}

After the fix, the number of threads stabilized, memory stopped growing, and the JVM finally calmed down.
Verification
A rerun confirmed the result:
✅ memory leveled off;
✅ thread count stopped increasing;
✅ no more JVM errors.
The service became predictable and stable, even under long load.
Takeaways
This story is a reminder that sometimes the issue hides in the most obvious place. One unclosed ExecutorService and your system quietly leaks memory until it falls over.
💡 What to remember:
- JVM errors related to native memory often point to thread leaks, not the heap.
- Every
ExecutorServiceshould have a clear lifecycle: create → use → close. - Capacity and long-run tests help catch these issues long before production.