How We Made @platformatic/kafka 223% Faster (And What We Learned Along the Way)

A few months ago, we wrote about why we built yet another Kafka client for Node.js. The benchmarks looked promising—we were outperforming KafkaJS and holding our own against the native clients. However, something was concerning. The numbers didn't align with what we were observing in production environments.

We continued running the tests and analyzing the results, but they consistently failed to match our production experience. The variance was high, the sample sizes were small, and we lacked confidence that we were measuring what we intended to measure.

We decided to revisit our approach fundamentally. Not just to make @platformatic/kafka faster, but to ensure we were testing it correctly in the first place.

It turned out our methodology was flawed. Correcting that led us down a path that resulted in substantial performance improvements.

Performance Summary

Here's where we ended up with v1.21.0:

Producer (Single Message): 92,441 op/sec—48% faster than KafkaJS
Producer (Batch): 4,465 op/sec—53% faster than KafkaJS
Consumer: 159,828 op/sec—9% faster than our previous version

That single-message producer number represents a 223% improvement over v1.16.0.

Benchmark Methodology Issues

When we initially ran benchmarks for our first blog post, we used what appeared to be a standard approach: send messages, measure elapsed time, and calculate operations per second.

The problem was that we were only capturing timing measurements every 100 messages. For the rdkafka-based libraries, we weren't properly waiting for delivery reports. We were essentially sending messages without tracking when they were actually acknowledged. The timing measurements were inconsistent and unreliable.

Our initial results reflected these methodological flaws:

┌─────────────────────────────────────────────┬─────────┬────────────────┬───────────┐
│ Library                                     │ Samples │         Result │ Tolerance │
├─────────────────────────────────────────────┼─────────┼────────────────┼───────────┤
│ node-rdkafka                                │     100 │  68.30 op/sec  │ ± 67.58 % │
│ @confluentinc/kafka-javascript (rdkafka)    │     100 │ 220.26 op/sec  │ ±  1.24 % │
│ KafkaJS                                     │     100 │ 383.82 op/sec  │ ±  3.91 % │
│ @platformatic/kafka                         │     100 │ 582.59 op/sec  │ ±  3.97 % │
└─────────────────────────────────────────────┴─────────┴────────────────┴───────────┘

Consider the variance on node-rdkafka: ±67.58%. This level of variance indicates unreliable measurements. Additionally, only 100 samples provided insufficient statistical confidence.

We completely rewrote the benchmark suite with the following improvements:

Per-operation timing: Instead of sampling every 100 messages, we now measure timing for each individual operation. This provides significantly more granular data and much lower variance.

Proper delivery tracking: For rdkafka-based libraries, we now send a message and wait for its specific delivery report before timing the next operation. This ensures accurate per-message timing.

Substantially larger sample sizes: We increased from 100 samples to 100,000 for most tests. While this increases execution time, the results are statistically meaningful.

When we re-ran the tests with the corrected methodology, the numbers improved dramatically across all libraries—particularly for the rdkafka-based ones:

Library	Producer Single	Producer Batch	Consumer
@platformatic/kafka v1.21.0	92,441 op/s	4,465 op/s	159,828 op/s
@platformatic/kafka v1.16.0	28,596 op/s	3,779 op/s	146,862 op/s
KafkaJS	62,450 op/s	2,923 op/s	120,279 op/s
node-rdkafka	16,488 op/s	701 op/s	133,526 op/s
Confluent KafkaJS	19,721 op/s	2,311 op/s	139,881 op/s
Confluent rdkafka	21,587 op/s	2,648 op/s	127,146 op/s

The libraries themselves hadn't changed—we had simply started measuring them accurately.

However, these improved benchmarks revealed performance issues in our own implementation that required attention.

Identifying and Addressing Performance Bottlenecks

With proper measurements in place, we could precisely identify where @platformatic/kafka was spending its time and where optimization opportunities existed.

Our v1.16.0 performance numbers were respectable—28,596 op/sec for single messages—but the ±34.18% variance was concerning. In production environments, variance of this magnitude translates to unpredictable latency spikes, which contradicts our design goals.

We began systematic profiling. The first bottleneck that became apparent was CRC32C computation. We were calculating checksums for every message (as required by the Kafka protocol) using a pure JavaScript implementation. While functional, it exhibited both low throughput and high variance.

We integrated @node-rs/crc32, a native Rust implementation (#126). The improvement was immediate and substantial—not just in throughput, but in consistency. The timing became significantly more predictable.

@baac0 contributed a pull request that refactored error handling in request serialization (#154). Initially, we viewed this primarily as code cleanup. This assessment proved incorrect. By handling errors asynchronously rather than blocking the serialization path, we eliminated an entire category of event loop blockages. Throughput increased substantially.

@jmdev12 identified a subtle bug in our metadata request handling (#144). We were improperly mixing callbacks in kPerformDeduplicated, which occasionally caused requests to hang or retry unnecessarily. Resolving this issue significantly improved connection handling reliability.

We also introduced a handleBackPressure option (#127) to provide users with control over flow control behavior. While the Kafka protocol includes back-pressure mechanisms where brokers can signal clients to slow down, we weren't handling this consistently. The new option allows fine-tuning of how the client responds to back-pressure signals.

After implementing these changes, we re-ran the benchmarks.

From 28,596 to 92,441 op/sec—a 223% improvement. More significantly, observe the variance reduction to ±1.05%.

Batch Processing Performance

Single-message performance is important for real-time event streaming, but many Kafka workloads involve bulk data pipelines sending hundreds or thousands of messages in batches.

Our batch performance was already competitive in v1.16.0—3,779 op/sec for batches of 100 messages. With the same optimizations applied, we observed improvements here as well:

This represents an 18% improvement to 4,465 op/sec. More significantly, we now outperform KafkaJS by 53% in batch scenarios. This performance difference becomes substantial when processing millions of messages daily.

Consumer Performance Improvements

Our consumer implementation was already performing well in initial tests, but we discovered several bugs. Partition assignment logic had issues (#138), and lag computation had edge cases that could produce incorrect results (#153).

Addressing these bugs improved performance from 146,862 to 159,828 op/sec:

The 9% throughput improvement is valuable, but the ±1.75% variance is more significant. This compares favorably to node-rdkafka's ±19.16% and the Confluent clients' ±18-24% variance. Consistent performance is often more valuable than peak throughput in production environments.

Performance Architecture

We frequently receive questions about how a pure JavaScript implementation can outperform native bindings to librdkafka. The answer lies not in a single optimization, but in the cumulative effect of multiple architectural decisions:

Minimal buffer copying: Every buffer allocation and copy adds garbage collection pressure. We designed the entire protocol handling layer to work with buffer slices and views wherever possible. When processing 90,000+ messages per second, avoiding unnecessary allocations has significant impact on both throughput and latency consistency.

Direct protocol implementation: There is no abstraction layer between application code and the wire protocol. Less indirection means fewer function calls, reduced stack manipulation, and more predictable performance characteristics. This also allows us to optimize hot paths without architectural constraints.

Non-blocking event loop usage: Node.js performs optimally when used according to its design principles—specifically, with async operations that don't block. The error handling refactor was particularly impactful. We had been blocking on error serialization in several code paths, and eliminating these blocks substantially reduced latency spikes.

Proper stream implementation: Node.js streams provide built-in back-pressure management when used correctly. When network sockets fill up, the stream pauses writes. When consumers cannot keep up, the fetch loop pauses. This keeps memory usage predictable and prevents unbounded memory growth.

Hot path optimization: Operations like CRC32C checksums, murmur2 partition hashing, and varint encoding execute for every single message. We profiled these operations extensively, optimized them, and profiled again. The migration to native CRC32C via Rust was the largest single improvement, but numerous smaller optimizations compound significantly at scale.

It's worth noting that librdkafka implements similar optimizations—it's exceptionally well-optimized C code. However, it must cross the Node.js/C++ boundary for every operation, and that boundary crossing carries measurable overhead. By remaining in JavaScript, we avoid that overhead entirely.

The Journey Continues

What started as a nagging doubt about our benchmark methodology turned into something far more valuable: a comprehensive understanding of our library's performance characteristics and a 223% improvement in single-message throughput.

The lessons from this experience are worth highlighting. First, measurement matters—flawed benchmarks don't just waste time, they obscure real performance issues. By fixing our methodology, we exposed bottlenecks we hadn't even known existed. Second, community contributions matter tremendously. The PRs from our contributors didn't just fix bugs—they fundamentally improved our throughput and reliability. Third, consistency matters as much as peak performance. Reducing variance from ±34% to ±1% means your p99 latencies become predictable, which is what production systems actually need.

The results speak for themselves: @platformatic/kafka v1.21.0 now delivers 92,441 op/sec for single messages and 159,828 op/sec for consumption, with variance under ±2% across all scenarios. It's 99% pure JavaScript library and yet it outperforms libraries built on highly optimized C code.

If you're building Node.js applications where Kafka performance matters, we encourage you to evaluate @platformatic/kafka:

npm install @platformatic/kafka

Run the benchmarks on your own infrastructure—we've published the complete test suite in BENCHMARKS.md. Test it against your workload patterns. And if you find issues or have optimization ideas, we welcome contributions at github.com/platformatic/kafka. After all, that's how we got here.

All benchmarks executed on an M2 Max MacBook Pro with Node.js 22.19.0 against a three-broker Kafka cluster. Results may vary based on hardware and network configurations, though relative performance characteristics should remain comparable.

How We Made @platformatic/kafka 223% Faster (And What We Learned Along the Way)

Performance Summary

Benchmark Methodology Issues

Identifying and Addressing Performance Bottlenecks

Batch Processing Performance

Consumer Performance Improvements

Performance Architecture

The Journey Continues

Comments

More from this blog

Stop Request Stampedes at the Gateway with Platformatic Deduplication

AWS ECS auto-scaler is broken (don’t worry, we’ve fixed it)

Destino: Doom in Your Terminal, Powered by Node.js FFI

Ahead of Time Scaling: How Platformatic ICC Predicts and Provisions

Run Medusa on Kubernetes with Watt as a Monorepo

Command Palette

Performance Summary

Benchmark Methodology Issues

Identifying and Addressing Performance Bottlenecks

Batch Processing Performance

Consumer Performance Improvements

Performance Architecture

The Journey Continues

Comments

More from this blog