Java Thread Leak: Finding the Root Cause and Fixing It

10.10.2025

A story about how a single unclosed ExecutorService in Java caused native memory leaks and JVM crashes. And how it was fixed.

bugsjava

🧵

Introduction

The project was a large e-commerce and healthcare client from the US (sorry, NDA — won’t share the name).
The stability of this module was crucial for key business processes, so the goal of the test was simple: make sure the system can handle long and heavy load without degradation.

In practice, things turned out to be a bit more interesting. During a performance run one of the services in Kubernetes started crashing from time to time with a

fatal Java Runtime Environment error

Kubernetes kept restarting it, but after a while, the crash happened again.

At first, it looked like a classic memory leak, but the heap was clean. The real problem hid deeper — in native memory and threads.

Contents


Test Context

ParameterValue
Platform:Kubernetes
Language:Java 17
Service:pricing component (🔒 name hidden)
Test type:Capacity
Goal:gradually increase the load to find the capacity point — the moment when the system reaches its performance peak before degrading

First Analysis and Hypotheses

CPU usage grew slowly and linearly, but memory kept growing until it hit the limit every time.

Dynatrace

To rule out external factors, I ran the same service locally in Docker Desktop and launched the same test again.

The result was identical: memory grew steadily, CPU remained stable. So the service wasn’t overloaded with computations — memory was leaking.

Docker

Alright, so it wasn’t the infrastructure — it was native memory. And the JVM logs confirmed that:

A fatal error has been detected by the Java Runtime Environment:
Native memory allocation (mprotect) failed to map memory for guard stack pages
Attempt to protect stack guard pages failed
os::commit_memory(...): error='Not enough space' (errno=12)

In plain English:

There were too many threads, and the OS simply ran out of space to allocate stack memory for new ones.


Finding the Cause

Looking through the code quickly led to a suspicious part: inside the enrich() method, a new ExecutorService was created on every call.

It looked neat — CompletableFuture, async calls, familiar pattern. But the thread pool was never closed.

❌ Before Fix

ExecutorService executor = Executors.newFixedThreadPool(WORKERS);
CompletableFuture<Void> prices = CompletableFuture.runAsync(
    () -> enrichPrices(context, productIds), executor);
CompletableFuture<Void> products = CompletableFuture.runAsync(
    () -> enrichProducts(context, productIds), executor);
joinAndRethrow(prices, products);

Each request created a new thread pool. The threads stayed alive and gradually filled up the memory until the JVM hit its limit.


The Fix

The fix was simple — add a try/finally block and close the pool after work was done.

Why shutdown() and not shutdownNow()? Because shutdown() lets ongoing tasks finish gracefully. shutdownNow() kills them instantly, which might lead to data loss. Reusing the pool wasn’t necessary here — the tasks were short and independent.

✅ After Fix

ExecutorService executor = Executors.newFixedThreadPool(WORKERS);
try {
    CompletableFuture<Void> prices = CompletableFuture.runAsync(
        () -> enrichPrices(context, productIds), executor);
    CompletableFuture<Void> products = CompletableFuture.runAsync(
        () -> enrichProducts(context, productIds), executor);
    joinAndRethrow(prices, products);
} finally {
    executor.shutdown();
}

git

After the fix, the number of threads stabilized, memory stopped growing, and the JVM finally calmed down.


Verification

A rerun confirmed the result:

✅ memory leveled off;
✅ thread count stopped increasing;
✅ no more JVM errors.

The service became predictable and stable, even under long load.


Takeaways

This story is a reminder that sometimes the issue hides in the most obvious place. One unclosed ExecutorService and your system quietly leaks memory until it falls over.

💡 What to remember:

  • JVM errors related to native memory often point to thread leaks, not the heap.
  • Every ExecutorService should have a clear lifecycle: create → use → close.
  • Capacity and long-run tests help catch these issues long before production.