Risks of Using setImmediate

In Node.js development, seemingly minor changes in underlying libraries can profoundly impact application behavior. One such change occurred in libuv 1.45.0, where a performance optimization altered how Node.js applications handle concurrent requests under load. This change has left a few developers scratching their heads as their previously responsive servers began failing health checks and appearing unresponsive during high CPU usage.

The issue, documented in Node.js issue #57364, reveals a fascinating intersection of event loop mechanics, performance optimization, and real-world application patterns. What started as a bug report about unresponsive health checks evolved into a deep dive into the fundamental workings of Node.js's event loop and the unintended consequences of relying on implicit timing behaviors.

Special thanks to @SuperOleg39 for reporting this issue and providing detailed analysis that helped the Node.js community understand the implications of the libuv changes.

The Problem Emerges

The issue first manifested in a deceptively simple scenario: a Node.js HTTP server with CPU-intensive request handlers that used setImmediate() to yield control back to the event loop. This pattern, commonly employed in server-side rendering (SSR) applications and other CPU-heavy web services, was designed to prevent the event loop from being completely blocked during intense computations.

Here's the problematic code pattern:

import http from 'node:http';
import pLimit from 'p-limit';

function heavy() {
  const start = Date.now();
  while (Date.now() - start < 1000) {
    // Simulate 1 second of CPU-intensive work
  }
}

const limit = pLimit(5);

const server = http.createServer((req, res) => {
  if (req.url === '/readyz') {
    // Health check endpoint
    res.writeHead(200, { 'Content-Type': 'text/plain' });
    res.end('OK');
    return;
  }

  limit(() => {
    return new Promise((resolve) => {
      setImmediate(() => {
        heavy(); // CPU-intensive work
        res.writeHead(200, { 'Content-Type': 'text/plain' });
        res.end('Response');
        resolve();
      });
    });
  });
});

This pattern worked reasonably well in Node.js 18.x and earlier versions. The /readyz health check endpoint would consistently respond within 5 seconds, even under heavy load from tools like autocannon generating hundreds of concurrent requests. However, with the introduction of libuv 1.45.0 (which was later reverted in Node.js 18.18.2 but remained in Node.js 20+), the same health check began timing out after 20+ seconds.

The Root Cause: A Deep Dive into libuv Changes

The culprit behind this behavioral change lies in libuv PR #3927, which removed a timer phase from the event loop. This change was part of a broader effort to optimize libuv's performance by consolidating timer handling to occur only after the poll phase.

Understanding the Event Loop Transformation

To understand the impact, we need to examine how Node.js's event loop operates and how this change affected the timing of various operations.

Before libuv 1.45.0: The event loop had multiple phases, including a dedicated timer phase that could run before the poll phase, which added a tiny overhead. When an HTTP request arrived:

The incoming socket connection triggered an I/O event
The incoming request data triggered another I/O event in the same poll phase
Immediates were run, e.g. setImmediate()
Timers were run
Timers were run (in a subsequent loop phase)

After libuv 1.45.0: The timer phase removal changed this behavior:

The same two I/O events (socket and data) now occur in separate poll phases
The helpful delays previously created by timers are eliminated
Applications using setImmediate() can more effectively monopolize the event loop

This seemingly minor optimization unintendedly made the event loop less "fair" in distributing processing time among different types of operations.

The Starvation Mechanism Explained

The key to understanding this issue lies in libuv's protection against immediate callback starvation. The event loop includes safeguards to prevent setImmediate() callbacks from completely blocking other operations. However, these safeguards have a specific threshold.

Looking at the libuv source code, we can see the protection mechanism in action:

    uv__io_poll(loop, timeout);

    /* Process immediate callbacks (e.g. write_cb) a small fixed number of
     * times to avoid loop starvation.*/
    for (r = 0; r < 8 && !uv__queue_empty(&loop->pending_queue); r++)
      uv__run_pending(loop);

The event loop will process up to 8 immediate callbacks before yielding control to other operations. Applications using just 2-4 setImmediate() calls (which was common in the problematic pattern) were well below this threshold, allowing them to effectively starve the event loop.

Real-World Impact Stories

Server-Side Rendering Applications

The change particularly affected SSR applications, which often use patterns like this:

// Common pattern in SSR frameworks
app.use((req, res, next) => {
  setImmediate(() => {
    // Render React/Vue components
    const html = renderToString(App);
    res.send(html);
  });
});

Teams using this technique found their applications becoming unresponsive during traffic spikes. Their custom request limiter relied on the old timing behavior and could no longer maintain responsiveness.

Failing health checks

Applications that previously handled load testing scenarios gracefully began failing basic health checks. A typical load test might generate 100+ concurrent requests, each triggering a setImmediate() callback. Under the old behavior, health checks could slip through between these callbacks. With the new behavior, they became starved out.

A better understanding of the problem is that the application was actually operating in the "danger zone" and just "faked" responsiveness while the event loop was, in fact, extremely busy. In those cases, it's much better to "fail" the health checks, allowing the infrastructure to scale adequately.

Technical Deep Dive: Event Loop Mechanics

To fully appreciate the implications of this change, let's examine the event loop phases and how they interact:

The Traditional Event Loop Phases

Timer Phase: Execute callbacks scheduled by setTimeout() and setInterval()
Pending Callbacks Phase: Execute I/O callbacks deferred to the next loop iteration
Poll Phase: Fetch new I/O events and execute I/O-related callbacks
Check Phase: Execute setImmediate() callbacks
Close Callbacks Phase: Execute close callbacks

The Impact of Timer Phase Removal

The removal of the separate timer phase meant that timers now execute only after the poll phase. This change had cascading effects:

Reduced Interleaving: Previously, timers could create natural breakpoints in processing
Increased Batching: I/O events that previously occurred in separate phases now batch together
Amplified Starvation: Applications using moderate amounts of setImmediate() could more easily dominate the event loop

Debugging and Diagnosis Techniques

CPU Profiling Reveals the Truth

The original issue reporter provided CPU profiles that clearly showed the problem:

Node.js 18: Health check handlers appeared every 5 seconds as expected
Node.js 20: Health check handlers appeared only after 20+ seconds, with large gaps in the timeline

Connection Behavior Analysis

Interestingly, the issue was specific to non-keep-alive connections. When connections were kept alive, the problem disappeared. This provided a crucial clue about the underlying mechanism:

// This workaround "fixes" the issue
const response = await fetch('/readyz');
await response.text(); // Consume the response body to keep the connection alive

Event Loop Lag Monitoring

Applications could diagnose the issue by monitoring event loop lag using Node.js's built-in monitoring API:

const { monitorEventLoopDelay } = require('perf_hooks');

const histogram = monitorEventLoopDelay({ resolution: 20 });
histogram.enable();

setInterval(() => {
  const lagMs = histogram.mean / 1000000; // Convert nanoseconds to milliseconds
  console.log(`Event loop lag: ${lagMs.toFixed(2)}ms`);

  if (lagMs > 100) {
    console.warn('High event loop lag detected!');
  }

  histogram.reset();
}, 5000);

Applications experiencing the issue would show dramatically increased lag during load testing.

Event Loop Utilization Analysis

Node.js 14.0.0 introduced Event Loop Utilization (ELU) as a more sophisticated metric for understanding event loop health. This API provides deeper insights into how the event loop is being utilized:

import { performance } from 'node:perf_hooks';

function monitorEventLoopUtilization() {
  const elu1 = performance.eventLoopUtilization();

  setTimeout(() => {
    const elu2 = performance.eventLoopUtilization(elu1);
    console.log(`Event Loop Utilization: ${(elu2.utilization * 100).toFixed(2)}%`);
    console.log(`Active time: ${elu2.active}ms`);
    console.log(`Idle time: ${elu2.idle}ms`);

    if (elu2.utilization > 0.8) {
      console.warn('Event loop heavily utilized - potential performance issues');
    }
  }, 1000);
}

setInterval(monitorEventLoopUtilization, 5000);

Applications affected by the libuv change would show sustained high utilization (>80%) during load testing, indicating that the event loop was spending most of its time processing immediate callbacks rather than handling new I/O operations.

Workarounds

1. Proper Response Completion Handling

Instead of resolving immediately after sending a response, wait for the connection to close:

import http from 'node:http'

const limit = pLimit(5);

const server = http.createServer((req, res) => {
  limit(() => {
    return new Promise((resolve) => {
      setImmediate(() => {
        heavy();
        res.writeHead(200, { 'Content-Type': 'text/plain' });
        res.end('Response');
        res.on('close', resolve); // Wait for actual completion
      });
    });
  });
});

2. Respecting the Starvation Protection Limit

Use 8 or more setImmediate() calls to trigger libuv's protection mechanism:

function immediate8(callback, count = 0) {
  setImmediate(() => {
    count++;
    if (count < 8) {
      immediate8(callback, count);
    } else {
      callback();
    }
  });
}

// Usage
immediate8(() => {
  heavy();
  res.end('Response');
});

3. Chunked Processing

Break CPU-intensive work into smaller chunks:

function processInChunks(data, chunkSize = 100) {
  return new Promise((resolve) => {
    let index = 0;

    function processChunk() {
      const endIndex = Math.min(index + chunkSize, data.length);

      // Process chunk
      for (let i = index; i < endIndex; i++) {
        processItem(data[i]);
      }

      index = endIndex;

      if (index < data.length) {
        setImmediate(processChunk);
      } else {
        resolve();
      }
    }

    processChunk();
  });
}

The Dangerous Pattern Exposed

This change didn't represent a philosophical shift in Node.js, but rather exposed a dangerous pattern that was problematic from the beginning. The setImmediate() technique used in request handlers was inherently risky and could lead to serious production issues.

Why This Pattern Was Always Dangerous

The practice of using setImmediate() to defer CPU-intensive work had several critical flaws:

Memory Usage Spikes: Each deferred request consumed memory while waiting in the immediate queue, leading to potential memory exhaustion under load
Garbage Collection Pressure: The accumulation of deferred callbacks created significant pressure on the garbage collector, actually increasing the CPU load long-term, and further worsening the situation.
False Sense of Responsiveness: The pattern gave the illusion of responsiveness while actually making the situation worse by accepting more work than the server could handle

The Fastify Lesson

This is precisely why Fastify, one of the most performance-focused Node.js frameworks, removed automatic setImmediate() calls from their codebase. As referenced in Fastify PR #545, the team discovered that this pattern was counterproductive under extreme load conditions, causing more harm than good.

The Fastify team's analysis showed that while setImmediate() seemed to help with responsiveness in light load scenarios, it became a liability under pressure, leading to:

Increased memory consumption
Longer garbage collection pauses
Cascading failures under sustained load

Solutions

Load Shedding based on Event Loop Utilization / Lag

Modern Node.js applications should adopt these patterns:

// Use proper load shedding
import underPressure from 'under-pressure';
import Fastify from 'fastify';

const fastify = Fastify()

fastify.register(underPressure, {
  maxEventLoopDelay: 1000,
  maxHeapUsedBytes: 100000000,
  maxRssBytes: 100000000,
  retryAfter: 50,
  message: 'Service temporarily unavailable'
});

Processing in Worker Threads

// Implement worker threads for CPU-intensive work
import { Worker, isMainThread, parentPort } from 'node:worker_threads';

if (isMainThread) {
  const worker = new Worker(__filename);
  worker.postMessage({ data: heavyWorkData });
  worker.on('message', (result) => {
    res.json(result);
  });
} else {
  parentPort.on('message', ({ data }) => {
    const result = performHeavyWork(data);
    parentPort.postMessage(result);
  });
}

Performance Implications and Monitoring

Before and After Metrics

Applications affected by this change typically saw:

Throughput: Decreased by 20-40% under extreme load
Latency: P95 response times increased by 2-5x
Error Rates: Health check failures increased from <1% to >10%
Resource Utilization: CPU utilization became more "bursty"

Monitoring Strategies

Effective monitoring for this issue includes:

// Event loop lag monitoring using Node.js built-in API
import { monitorEventLoopDelay } from 'node:perf_hooks';
import { createServer } from 'node:http'

const histogram = monitorEventLoopDelay({ resolution: 20 });
histogram.enable();

setInterval(() => {
  const currentLag = histogram.mean / 1000000; // Convert nanoseconds to milliseconds
  console.log(`Event loop lag: ${currentLag.toFixed(2)}ms`);

  // Log percentiles for detailed analysis
  console.log(`P50: ${(histogram.percentile(50) / 1000000).toFixed(2)}ms`);
  console.log(`P95: ${(histogram.percentile(95) / 1000000).toFixed(2)}ms`);
  console.log(`P99: ${(histogram.percentile(99) / 1000000).toFixed(2)}ms`);

  if (currentLag > 100) {
    console.warn('High event loop lag detected!');
  }

  histogram.reset();
}, 5000);

// Request queue monitoring
let activeRequests = 0;
let queuedRequests = 0;

const server = createServer();
server.on('request', (req, res) => {
  queuedRequests++;

  const processRequest = () => {
    activeRequests++;
    queuedRequests--;

    // Process request
    res.on('finish', () => {
      activeRequests--;
    });
  };

  if (activeRequests < maxConcurrentRequests) {
    processRequest();
  } else {
    // Queue or reject request
  }
});

server.listen(3000)

Lessons Learned and Best Practices

Key Takeaways

Don't Rely on Implicit Timing: Applications should not depend on specific event loop timing behaviors
Explicit Load Management: Implement explicit load shedding and rate limiting
Proper Health Check Isolation: Ensure health checks can't be starved by application logic
Monitor Event Loop Health: Continuously monitor event loop lag and responsiveness

How Watt Can Help

Watt, Platformatic's Node.js application server, provides a robust solution to the responsiveness challenges highlighted in this post. Its multi-threaded architecture ensures that critical endpoints like health checks (/metrics, /readiness, /liveness) remain consistently available, even when the main application thread is under heavy load. For more details on configuring readiness and liveness probes in Watt, see the Kubernetes deployment documentation.

Unlike traditional Node.js applications that rely on a single event loop, Watt's multi-threaded approach isolates these essential monitoring endpoints from the business logic of your application. This means that regardless of how busy your application's event loop becomes—whether due to CPU-intensive operations, high request volumes, or problematic setImmediate() usage patterns—your infrastructure monitoring and health check systems can always reach these endpoints.

This architectural advantage makes Watt particularly valuable for production environments where reliable health checks are critical for load balancers, orchestration systems, and monitoring tools. By ensuring these endpoints are always responsive, Watt helps prevent cascading failures and maintains system observability even under extreme load conditions.

Conclusion

The libuv 1.45.0 change exposed a dangerous anti-pattern that was always problematic but previously masked by event loop timing quirks. This serves as a reminder that performance optimizations can reveal hidden issues in application code.

The key lessons for developers:

Avoid Anti-Patterns: Never use setImmediate() for load management
Explicit Load Management: Implement proper load shedding and rate limiting
Testing Across Versions: Thoroughly test applications across Node.js versions

For developers facing this issue, there are two paths forward. The traditional approach requires significant engineering effort: code auditing, architectural refactoring, and extensive testing. Alternatively, adopting Watt provides an effortless solution with no code changes required—its multi-threaded architecture automatically ensures reliable health checks regardless of application load.

If you're experiencing Node.js performance issues or need guidance on optimizing your applications, contact us for expert assistance.

The dangers of setImmediate

The Problem Emerges

The Root Cause: A Deep Dive into libuv Changes

Understanding the Event Loop Transformation

The Starvation Mechanism Explained

Real-World Impact Stories

Server-Side Rendering Applications

Failing health checks

Technical Deep Dive: Event Loop Mechanics

The Traditional Event Loop Phases

The Impact of Timer Phase Removal

Debugging and Diagnosis Techniques

CPU Profiling Reveals the Truth

Connection Behavior Analysis

Event Loop Lag Monitoring

Event Loop Utilization Analysis

Workarounds

1. Proper Response Completion Handling

2. Respecting the Starvation Protection Limit

3. Chunked Processing

The Dangerous Pattern Exposed

Why This Pattern Was Always Dangerous

The Fastify Lesson

Solutions

Load Shedding based on Event Loop Utilization / Lag

Processing in Worker Threads

Performance Implications and Monitoring

Before and After Metrics

Monitoring Strategies

Lessons Learned and Best Practices

Key Takeaways

How Watt Can Help

Conclusion

Comments

More from this blog

Ahead of Time Scaling: How Platformatic ICC Predicts and Provisions

Run Medusa on Kubernetes with Watt as a Monorepo

Agents in Production: Reliable Orchestration and Security Enforcement on Kubernetes

Introducing Regina: Stateful AI Agent Orchestration for Platformatic Watt

@platformatic/kafka Now Supports Confluent Schema Registry

Command Palette

The Problem Emerges

The Root Cause: A Deep Dive into libuv Changes

Understanding the Event Loop Transformation

The Starvation Mechanism Explained

Real-World Impact Stories

Server-Side Rendering Applications

Failing health checks

Technical Deep Dive: Event Loop Mechanics

The Traditional Event Loop Phases

The Impact of Timer Phase Removal

Debugging and Diagnosis Techniques

CPU Profiling Reveals the Truth

Connection Behavior Analysis

Event Loop Lag Monitoring

Event Loop Utilization Analysis

Workarounds

1. Proper Response Completion Handling

2. Respecting the Starvation Protection Limit

3. Chunked Processing

The Dangerous Pattern Exposed

Why This Pattern Was Always Dangerous

The Fastify Lesson

Solutions

Load Shedding based on Event Loop Utilization / Lag

Processing in Worker Threads

Performance Implications and Monitoring

Before and After Metrics

Monitoring Strategies

Lessons Learned and Best Practices

Key Takeaways

How Watt Can Help

Conclusion

Comments

More from this blog