Ahead-of-Time Scaling: Predict Autoscaling Platformatic ICC

Most Kubernetes autoscalers work the same way: they check a metric, compare it to a threshold, and then add pods. The issue is that by the time the metric crosses the threshold, the application is already overloaded. New pods take time to start (we’re talking in the realm of 1-4 minutes here, and that’s if you have the compute resources readily available) and begin handling traffic, so during that period, the existing pods handle all the load. This means for 2-4 minutes (or upwards of 10 if your cluster is out of available compute), users end up seeing slow responses or errors, not because scaling failed, but because it happened too late. That’s more than enough time to make a negative impact on your business (abandoned carts, logged-off streamers, the works).

The Horizontal Pod Autoscaler (HPA) is the most common scaler in Kubernetes. It runs a control loop that checks the current metric value, usually CPU usage, and calculates how many pods are needed based on the ratio of the current value to the target. For Node.js apps, CPU isn't a great metric: the event loop can be overloaded and queuing requests even when CPU usage looks normal. HPA doesn't support metrics like Event Loop Utilization (ELU) by default; you need a custom metrics setup with Prometheus and an adapter to use those.

KEDA addresses the metric issue by extending HPA with many event-driven triggers, like Prometheus queries, message queues, and HTTP request rates. This makes it easy to scale on ELU or other custom metrics. However, the underlying scaling logic is unchanged: each check is just a snapshot of the current value, with no sense of past trends or whether the metric is going up or down. KEDA gives you better metrics, but still uses the same reactive approach.

Reactive scaling is especially tough on Node.js apps for some of the reasons we’ve already mentioned. Namely, CPU utilization is not just a lagging metric on ELU performance (what really matters for Node.js), but one that sometimes doesn’t even correlate to ELU as you might expect. The event loop runs JavaScript one callback at a time, and when it gets overloaded, performance doesn't just slow down gradually; it drops off sharply. Latency increases quickly until the app can barely make progress. The kicker is that all this happens before the HPA even knows it needs to scale more pods.

Unfortunately, this is much more than a simple configuration issue. Lowering the threshold doesn't make the scaler react any faster; it just makes it respond to a lower value (saw, lower CPU utilization). In practice, particularly when taking this approach for running high-traffic Node.js apps, this means you'll always have more pods running than you need. (This can really add up more than you might realize. We’ve seen some pretty drastically over-provisioned clusters and scaling policies - we are talking up to 7-figures in excess cloud spend per major scaling event.)

So - what if instead of scaling reactively, you could scale… proactively? What might such a system look like?

Well, you probably guess where we’re going.

Platformatic ICC (“Intelligent Command Center”) takes a different approach. Rather than waiting for a metric to hit a threshold, ICC watches the load trend over time and predicts where it will be when a new pod is ready. If it looks like more capacity will be needed, ICC adds pods right away, so they're ready when the extra load arrives.

Benchmarks show a clear difference: with steady traffic increases, ICC kept median response latency at 26 ms, while KEDA reached 154 ms and HPA hit 522 ms:

	ICC	KEDA	HPA
Success Rate	99.47 %	95.11 %	90.97 %
Avg. Latency	167 ms	1,174 ms	1,499 ms
Median Latency	26 ms	154 ms	522 ms
p(90) latency	317 ms	3,530 ms	4,168 ms
p(99) latency	1,970 ms	10,001 ms	10,001 ms
Errors	718	6,591	12,039

Below we will cover:

Some basics on the event loop, latency, and the structural incompatibilities Node.js has with traditional scaling methods.
How reactive scalers like the HPA and KEDA work
Overview the ICC and its predictive scaling algorithm
Do some load testing and compare benchmarks.

Let’s dig in.

The Node.js event loop and the latency cliff

We’ll start with a quick review of some Node.js and JavaScript basics. The heart of Node.js is the event loop, which runs JavaScript callbacks one at a time on a single thread. It cycles through different phases, picking up ready callbacks and running them in order.

A typical HTTP request shows how this works. When a request comes in, the event loop runs the handler callback, which parses data, checks the input, and runs business logic. This part is synchronous, so nothing else can happen in the loop at the same time. If the handler needs to access a database or call an external API, Node.js hands off that work to the operating system or a background thread pool, and the event loop moves on to other callbacks. When the I/O finishes, a new callback is added to the queue, and the event loop picks it up later to finish processing, like reading the database result and sending the response.

This design is what makes Node.js efficient. At any time, an app might have hundreds of requests in progress, but most are just waiting for I/O and not using the event loop. One thread can handle thousands of connections with little overhead, since there's no context switching or lock contention. This efficiency relies on the event loop having some idle time between callbacks.

As traffic grows, the synchronous parts of handling requests—like parsing bodies, serializing JSON, running business logic, or rendering server-side React—start to use up more of the idle time. While this code runs, nothing else can happen. Eventually, those idle gaps disappear.

Event Loop Utilization (ELU) measures this effect. It's a value from 0 to 1 that shows how much time the event loop spends running code versus being idle. An ELU of 0.5 means the loop is active half the time, while 0.9 means there's almost no idle time left.

Trouble begins when ELU gets close to 1.0. Now, the loop has no idle time left, so every new request arrives while the previous one is still being processed. Callbacks start to pile up. With almost no idle gaps, even a small traffic increase can make wait times jump from milliseconds to seconds.

This is what we call the cliff. When ELU reaches 1.0, the app enters a feedback loop: the queue grows, each request takes longer because it waits behind more requests, and the loop stays saturated, making the queue grow even more. Response times don’t just increase linearly now, but hyperbolically. The app hasn't crashed, but it's no longer making real progress. Responses that used to take 30 ms now take 5 seconds or even hit the client timeout. You can see this in our interactive capacity model: adjust the processing time and traffic rate, and you'll notice response times stay low until about 70–80% utilization, then suddenly spike as the event loop gets saturated.

That's why waiting to scale up a Node.js app until after ELU crosses the threshold is so harmful. By the time HPA or KEDA notices and adds pods, the event loop has already gone over the cliff. The queue grows faster than the loop can handle, and every new request just adds to the problem. The pod can't recover on its own while traffic stays high, and it will stay stuck in this feedback loop for the 1 to 4minutes it takes new pods to start and take on traffic.

To put it another way, the HPA can get away with using pod CPU utilization for most runtimes (Java, .NET, etc.) because CPU utilization is actually a fairly accurate representation of how loaded that app is. Therefore, using that metric makes sense to determine when to scale new pods. (Again, this still will be reactive, but at least it’s reacting to a reasonable metric.)

That correlation between CPU and application load doesn’t exist with Node.js. This compounds the problem with reactive scaling for Node.js apps in particular because you are scaling on a metric that doesn’t strongly correlate to your application's actual load.

How reactive scalers work (and why they're always late)

To see why predictive scaling is important, let's look at how the most popular Kubernetes scalers actually make scaling decisions.

HPA: one number, one formula

The Kubernetes Horizontal Pod Autoscaler (HPA) runs a control loop every 15 seconds. On each cycle, it fetches the current metric value from the Metrics Server (typically CPU utilization) and computes the desired replica count with a single formula:

desiredReplicas = ceil(currentReplicas × (currentValue / targetValue))

For example, if 4 pods are running at 90% CPU with a 70% target:

desiredReplicas = ceil(4 × (90 / 70)) = ceil(5.14) = 6

Why HPA is the wrong choice for Node.js

HPA uses CPU utilization by default, and most teams stick with that. But for Node.js apps, this isn't a good match. As mentioned earlier, Event Loop Utilization (ELU) is the metric that really shows if a Node.js app is overloaded, and CPU doesn't reflect that. The event loop can be maxed out while CPU usage looks normal, or the other way around.

HPA doesn't support ELU by default. It works with the Metrics Server, which only provides CPU and memory. To scale on ELU, you need to set up a custom metrics pipeline using Prometheus, a Prometheus adapter, and a custom metric query. (Yes - we can help with that!)

KEDA: right metric, same logic

KEDA builds on HPA by adding many event-driven triggers, like message queues, HTTP request rates, Prometheus queries, and more. This makes it easy to scale on custom metrics like ELU without having to build a full custom metrics pipeline.

But the scaling logic doesn't change. When scaling from 1 to N replicas, KEDA creates an HPA object behind the scenes and gives it the external metric value. It uses the same formula and snapshot-based checks. KEDA gives you better metrics, but the way it decides when to scale is still the same. (Different data, same algorithm.)

By default, KEDA checks metrics every 30 seconds, which is twice as slow as HPA. In our benchmarks, we set it to 15 seconds to match HPA and make the comparison fair.

The core limitations

Even if you use the right metric, the reactive approach has basic limitations that can't be fixed by changing settings.

The startup gap. After a reactive scaler decides to add pods, there is a delay before those pods are useful:

The scaler detects the threshold has been crossed (up to 15–30s depending on polling interval)
Kubernetes schedules the new pod and pulls the container image
The application starts and initializes
The readiness probe passes
The load balancer begins routing traffic to the new pod

In real enterprise settings, this can take anywhere from 1 minute to upwards of 4 minutes. During that time, the existing pods handle all the traffic. By the time new pods are ready, the app might have already spent over 2 minutes overloaded. For Node.js apps, this is when performance drops off sharply.

The saturation cap. When the metric has a natural cap (like ELU, which maxes out at 1.0), the scaler loses visibility into the actual load. A pod at ELU 1.0 could need one more pod or ten more, but the formula sees the same number either way. The true load is hidden behind the cap. This forces the scaler into a staircase pattern: it adds pods based on what it can see, waits for the new pods to also become saturated, and only then realizes more are needed, which compounds the problem. Each step requires a full cycle of pod startup and saturation before the next decision can be made. The scaler cannot reach the right pod count in one step because it never sees the right number, leading to spiraling performance problems.

Redistribution problem. Every time HPA or KEDA adds a pod, it creates a temporary distortion in the metrics. The new pod starts receiving traffic immediately, but the existing pods don't shed their load at the same pace: queues take time to drain, in-flight requests must complete, and garbage collection needs to settle. During this transition, the new pod's metric is rising while the old pods' metrics haven't dropped yet. The scaler sees the sum go up and interprets it as growing demand, when it's actually just the overlap of old and new pods, both holding load at the same time. This can lead the scaler to add pods that aren't needed. ICC handles this with a dedicated redistribution stage that gradually includes new pods' metrics over time, filters out the artificial drop from load shedding, and still lets real load increases pass through immediately, no cooldown required (see the algorithm whitepaper for details).

All these issues come from the same cause: traditional scalers like the HPA or KEDA look at scaling metrics in isolation; as in, without the contextual data of whether that metric has been trending up or down over a given time period. Instead, the scaler treats each check as separate. It looks at one value, compares it to a target, and acts, without knowing if the metric is going up, down, or staying the same. It also can't account for the delay between making a decision and when new capacity is actually available, and how that extra capacity might impact the metric it’s measuring to make its scaling decisions.

Intelligent Command Center

Platformatic Intelligent Command Center (ICC) is a cloud control plane that provides intelligent management, monitoring, and optimization of Node.js applications deployed in Kubernetes. Applications run on Platformatic Watt, the Platformatic runtime for running high-performance Node.js apps. A single Watt instance can host multiple Node.js applications, each in its own worker thread within the same process. In a Kubernetes deployment, each pod runs one Watt instance (and each Watt instance could run multiple Node.js apps as worker threads).

A companion module, @platformatic/watt-extra, runs alongside Watt in each pod. It collects runtime metrics (including ELU and heap usage) and sends them to ICC, which uses them to make scaling decisions.

Data flow in ICC. Each pod runs a Watt instance hosting one or more applications. Watt measures per-application metrics (like ELU); Watt-Extra collects them into batches and sends them to ICC. ICC runs the algorithm pipeline and updates the Kubernetes Deployment replica count.

How ICC's predictive scaling works

The idea

A reactive scaler asks: "Is the application overloaded right now?" and acts on the answer. By the time new pods are ready, the answer has changed, usually for the worse.

ICC takes a different approach and asks, "Will the app be overloaded by the time a new pod is ready?" If so, it scales up right away, so the extra capacity is available when needed. This shifts scaling from reacting to what's happening now to acting based on a forecast.

To build that forecast, ICC tracks the load trend over time: not just the current value, but whether it is rising, falling, or stable, and how fast. It extrapolates the trend forward by the time it takes a new pod to start and begin serving traffic. If the projected load exceeds the capacity of the current pod count, ICC adds pods immediately. The full details of the algorithm are described in the algorithm whitepaper.

The chart shows ELU per pod over the last 20 seconds. The solid line (Mt) has been rising steadily. Right now it's at 0.73 (Mnow), just below the 0.75 threshold (dashed red line). HPA or KEDA would look at this value, see that it hasn't crossed the threshold, and do nothing. ICC sees the trend and projects that by the time a new pod would be ready (the prediction horizon H), the metric will reach 0.78 (Mh), above the threshold. So it scales up now, before the overload begins.

The rest of this section explains how the algorithm builds this prediction.

The core idea. The algorithm takes per-pod metric values (like ELU on each pod), combines them into a single cluster-wide number, predicts where that number is heading, and converts the prediction back into a per-pod value to compare against the threshold. This aggregate-predict-project flow is the backbone of the algorithm.

Why predict on an aggregate? Per-pod metrics change for two reasons: external traffic changes and the scaler's own actions. When the scaler adds a pod, the load balancer starts routing traffic to it, and ELU on the existing pods drops, even though external traffic hasn't changed at all. If the algorithm predicted the trend from per-pod ELU, it would see this drop as "load is decreasing" and might delay further scaling when it's actually needed. The algorithm avoids this by summing ELU across all pods into a cluster-wide aggregate. When a pod is added, and load redistributes, individual ELU values shift, but the total stays approximately the same (the same total work is spread across more pods). The aggregate reflects external traffic changes without being distorted by scaling actions, giving the algorithm a stable signal to predict from.

Cleaning the data. Raw metric data is not ready for prediction. Pods send measurements in batches at different times, so at any given moment, some pods have reported recent data and others haven't. After a scale-up, new pods create temporary distortions in the metrics. Three preprocessing stages handle this before the aggregate reaches the prediction stage.

Alignment places irregularly-timed samples onto a uniform time grid (e.g., one tick per second) by interpolation, so values from different pods can be compared at the same points in time.

Imputation estimates values for pods that haven't been reported yet. At each tick, the algorithm takes the previous total, subtracts the previous values of pods that have now reported new data, and uses the remainder as the estimated contribution of the pods still missing. When a late batch arrives, the estimates are replaced by real data and the totals are recomputed.

Redistribution smooths out the metric distortion after a scale-up. New pods' values are included gradually (their contribution ramps from zero to full over a stabilization period) rather than appearing all at once. At the same time, the artificial drop on existing pods as they shed load is absorbed: the algorithm allows the aggregate to rise (to catch real traffic increases) but prevents it from dropping while new pods are still stabilizing. This way, redistribution artifacts are filtered out, but real load changes pass through immediately.

Predicting the trend. The cleaned aggregate enters the prediction stage, which uses Holt's double exponential smoothing. This method maintains two values at each tick: the level (a smoothed estimate of where the aggregate is now) and the trend (how fast the aggregate is changing). Each new data point updates both. The level tracks the signal while filtering single-tick noise. The trend builds gradually over multiple ticks, converging to the actual rate of change. A single noisy tick pushes the level only slightly while the trend absorbs the rest. This lets the smoothing be aggressive enough to filter noise while still reacting quickly to sustained changes.

Asymmetric reaction. The algorithm uses different smoothing parameters for upward and downward movements. When the metric rises faster than the forecast, the algorithm picks it up quickly (both the level and the trend react aggressively) because missing a spike means the application enters the latency cliff before the scaler can respond. When the metric drops, the algorithm follows slowly, letting the downward trend build over many ticks before acting on it. A brief dip might be noise, and scaling down too eagerly risks having to scale right back up. This reflects the reality that under-provisioning is immediately damaging, while brief over-provisioning only costs resources.

The prediction horizon. The horizon H determines how far into the future the algorithm looks when extrapolating the trend. It is derived from observed pod startup times: how long it actually takes a new pod to be scheduled, initialized, and ready to serve traffic. ICC measures this from real-scale-up events in the cluster and adapts over time, so the horizon tracks actual infrastructure conditions rather than relying on a hardcoded constant. A multiplier extends the horizon slightly beyond the measured startup time to provide a safety buffer, and configurable floor and ceiling bounds prevent the horizon from becoming too short (which would reduce the algorithm's effectiveness) or too long (which would make the extrapolation unreliable).

Handling metric saturation. Some metrics have a natural cap. ELU maxes out at 1.0: once the event loop is fully saturated, ELU cannot rise further, no matter how much more traffic arrives. Without special handling, the trend would decay to zero during saturation (the level stops rising because the input is clipped), and the algorithm would stop scaling even though the load is still growing behind the cap. ICC handles this by preserving the trend during saturation: the trend is allowed to increase but never decrease while the metric is clipped, so the algorithm continues to scale up even when the signal is flat at its maximum.

The scaling decision. The prediction stage produces a predicted aggregate (AH): the forecasted total load at the horizon. The decision stage converts this back into a per-instance value by dividing by the current pod count, producing the projected per-pod metric at the horizon (MH). This is what the chart shows. If MH exceeds the threshold τ, the algorithm computes how many pods are needed to keep the per-instance metric below the threshold and scales up immediately. If the trend is flat or falling and the metric is within the threshold, it considers scaling down, with a safety margin to avoid immediately scaling right back up.

The full algorithm, including the mathematical formulation and worked examples, is described in the algorithm whitepaper.

Signals

Accurate forecasting depends on good data. If you average metrics over 15 seconds, you lose the details that matter for short-term prediction—a sharp spike in the last 5 seconds looks the same as a slow rise over 15. This makes trend estimates and forecasts less precise. ICC uses raw metric samples instead. Each pod sends every measurement to ICC in batches, with no averaging or data loss. The batch timing adjusts based on load: under heavy traffic, batches go out every 5 seconds for fresh data; when idle, they go out every 40 seconds to save resources. This way, a spike that started 5 seconds ago is seen right away, not hidden in a delayed average.

Benchmarks

To measure the difference predictive scaling makes in practice, we tested ICC against HPA and KEDA under identical conditions on the same cluster with the same application.

Test setup

A Next.js 16 e-commerce application (App Router, Server Components, SSR) runs on Platformatic Watt with one worker per pod (1 CPU / 2 GB RAM). An Envoy proxy with 30-second linear slow start sits between the load balancer and the pods, ramping traffic to new pods gradually so that V8 JIT compilation on cold code paths does not distort the comparison.

All three scalers operate on the same deployment (min 4, max 20 pods):

ICC, predictive scaling on ELU with a 0.7 threshold
KEDA, scaling on ELU (via Prometheus query) with a 0.7 threshold
HPA, scaling on CPU utilization with a 70% target

KEDA uses the same metric and threshold as ICC, so the comparison isolates the scaling algorithm. HPA is included because it is the most widely deployed Kubernetes scaler; its results reflect the choice of metric (CPU instead of ELU) in addition to the reactive algorithm.

The benchmark ran on AWS EKS (us-east-1), Kubernetes v1.35, with 4 worker nodes (m5.2xlarge: 8 vCPU, 32 GB RAM each). Load was generated from a dedicated EC2 instance (c7gn.2xlarge, ARM64) in the same VPC using Grafana k6. The full benchmark automation, scaler configurations, and raw data are available in the benchmark repository.

Each chart below shows three traces: the average ELU across all pods (purple, left axis), the pod count (green, right axis), and the target request rate (grey shaded area). The dashed red line marks the ELU threshold of 0.7.

Steady ramp

Traffic grows from 10 to 800 req/s over ~2.5 minutes, then holds at 800 req/s for 90 seconds. This is the most common real-world pattern: traffic grows gradually as users arrive over the course of minutes.

ICC

The predictive algorithm keeps ELU below the 0.7 threshold. It watches the trend, predicts where ELU is heading, and scales up ahead of time to match capacity. It also avoids over-provisioning by using only as many pods as needed to keep ELU near the threshold.

KEDA

KEDA uses the same metric and threshold, but since it's reactive, it waits for ELU to cross the threshold before scaling. This means it can't keep ELU below the threshold during traffic increases, and average ELU hits 0.92 at the peak. The problem gets worse as overloaded pods slow down non-linearly due to queuing, eventually forcing the scaler to add even more pods.

Lowering the threshold doesn't fix the problem. It doesn't make the scaler react faster; it just makes it respond to a lower value. The app then runs at a lower utilization all the time, using more pods for the same load. This means you always pay for extra capacity, not just during spikes.

HPA

HPA behaves like KEDA, but it scales based on CPU usage instead of ELU. Since CPU isn't a measure of a Node.js application's health, we see that ELU stays elevated for the majority of our load test.

Comparing the results

How the scaler works directly affects what users experience. When ELU stays below the threshold, the event loop handles requests quickly. But if ELU goes over the threshold, queues and delays grow, and response times can reach the client timeout.

	ICC	KEDA	HPA
Success Rate	99.47 %	95.11 %	90.97%
Avg. latancy	167 ms	1,174 ms	1,499 ms
Median latency	26 ms	154 ms	522 ms
p(90) latency	317 ms	3,530 ms	4,168 ms
p(99) latency	1,970 ms	10,001 ms	10,001 ms
Errors	718	6,591	12,039

ICC kept ELU close to the threshold, with a 99.47% success rate and 317 ms at the 90th percentile. KEDA and HPA spent a lot of time above the threshold: KEDA lost 5% of requests, and HPA lost 9%. Their 99th percentile latencies hit the 10-second client timeout because queues grew faster than the event loop could handle.

Sudden spike

In this test, traffic jumps from 0 to 800 requests per second in 10 seconds, then stays at 800 for 2 minutes. No scaler can stop the initial overload—there's no trend history to predict from and no time to start new pods. The real question is how fast each scaler recovers.

ICC

Without any trend history, ICC can't predict the spike. But as soon as the first data comes in, it quickly builds a trend estimate and starts scaling aggressively. The algorithm keeps scaling even when ELU is maxed out at 1.0, maintaining the trend through saturation instead of being fooled by the cap.

KEDA

The reactive formula scales in proportion to the current overload ratio, but each decision is based on a single snapshot. It cannot account for the fact that the load arrived all at once, and more capacity is needed than the current ratio suggests. The result is a staircase of incremental scale-ups, each insufficient, while ELU remains elevated.

HPA

HPA has the same reactive limitation as KEDA, but it's even worse because it uses CPU utilization, which lags behind event loop saturation in Node.js apps. The scaler doesn't see the real urgency, so it scales up even more slowly.

	ICC	KEDA	HPA
Success Rate	91.51 %	87.47 %	77.31 %
Avg latency	1,126 ms	1,989 ms	2,205 ms
Median latency	55 ms	855 ms	1,102 ms
p(90) latency	3,385 ms	6,108 ms	7,338 ms
p(99) latency	10,001 ms	10,001 ms	10,001 ms
Errors	8,028	11,212	21,067

All three scalers struggle during the initial burst—the 99th percentile latency hits the 10-second client timeout for all of them. The difference is in how they recover. ICC's median latency drops to 55 ms after the burst, so most requests are served normally. KEDA (855 ms) and HPA (1,102 ms) stay slow throughout the test, and HPA loses almost a quarter of all requests.

Conclusion

Reactive scaling has a built-in limit. No matter how you adjust HPA or KEDA, they'll always spot overload after it starts, scale up after the damage is done, and have trouble handling the effects of their own actions.

Predictive scaling with ICC gets rid of this problem. By watching the load trend and forecasting where it will be when new pods are ready, ICC scales up before demand hits. The benchmarks show the impact: median latency is six times lower than KEDA, twenty times lower than HPA, and ICC achieves a 99.47% success rate during heavy traffic.

This also changes how you manage your baseline. If you can't trust your scaler to handle spikes smoothly, you keep extra pods running as a safety net, which means paying for idle capacity all the time. But if your scaler can add pods ahead of time without hurting performance, you can run closer to real demand. Predictive scaling not only boosts performance under load—it also cuts costs when traffic is low.

ICC is available now. To get started:

If you’re running high-traffic systems, we’d love to chat! Drop us a note at hello@platformatic or reach out to us on LinkedIn.

Thanks and happy building!

Ahead of Time Scaling: How Platformatic ICC Predicts and Provisions

The Node.js event loop and the latency cliff