Durable Workflows for Kubernetes with Version-Safe Orchestra

Workflow DevKit lets you write durable, long-running workflows directly in your Next.js and Node.js apps. You define steps with ’use step’, and the SDK handles persistence, retries, and replay automatically. Workflows survive server restarts, can sleep for days, and resume exactly where they left off.

On Vercel, all of this works out of the box — the platform handles deployment versioning and queue routing behind the scenes. But what happens when you deploy to your own Kubernetes cluster? Version mismatch. And it’s subtle enough to corrupt data before you notice.

We built Platformatic World to fix this. It’s a drop-in World implementation that brings the same deployment safety to any Kubernetes cluster. Every workflow run is pinned to the code version that created it. Queue messages are routed to the correct versioned pods. Old versions stay alive until all their in-flight runs are complete.

The version mismatch problem

Workflow DevKit uses deterministic replay. When a workflow resumes after a step, it runs the whole function again from the start, matching each step to its cached result by order. The correlation IDs that link steps to cached results come from a seeded random number generator tied to the run ID. If the code and seed are the same, the sequence stays the same.

This works perfectly until you deploy a new version.

If a run that started on v1 replays on v2 and the step order has changed, the correlation IDs won’t match anymore. For example, the cached result from chargeCard could be used for the new addDiscount step:

The workflow can quietly produce wrong results or fail in ways that are hard to spot. On Vercel, the Vercel World handles this for you. On self-hosted Kubernetes, you have to manage it yourself.

We already solved this for HTTP

ICC (Intelligent Command Center) is our Kubernetes controller for managing app deployments. We recently added skew protection. Here’s how it works for HTTP traffic:

When a user starts a session on version N, a cookie pins all subsequent requests to version N via Gateway API HTTPRoute rules. New visitors are routed to the latest active version.

Workflow runs work the same way: a run that starts on version N must keep running on version N until it finishes. The difference is in the transport. HTTP requests go through the Gateway API, but workflow queue messages do not.

Why we couldn’t just extend the Intelligent Command Center

Our first design had pods accessing PostgreSQL directly, with ICC handling queue routing. We abandoned it because the ICC couldn’t reliably determine when a version had no in-flight runs.

The problem: workflow runs can be suspended in ways that are invisible to the infrastructure

When a workflow registers a webhook and then suspends, the pod becomes idle. There’s no memory use, no heartbeat, and no queue message. ICC sees no activity and expires the version. If someone clicks the webhook link hours later, the run’s pods are already gone:

The only way to know if a version still has work in progress is to check the runs table. For that, you need a service that owns the data.

How Platformatic World works

Platformatic World consists of two packages:

@platformatic/workflow is a Watt application backed by PostgreSQL that manages all workflow state and queue routing. Every operation, like event creation, run queries, queue dispatch, hook registration, and encryption, goes through it.
@platformatic/world is a lightweight HTTP client that implements the Workflow DevKit’s World interface. This is what your app imports.

The service enforces multi-tenant isolation at the SQL level by scoping every query to the application_id.

Version-aware queue routing

Each queue message includes a deployment_version. The router finds the registered handler for that version and sends the message to the right pod. Messages for v1 always go to v1 pods, even after v2 is deployed:

If a dispatch fails, it uses exponential backoff and tries up to 10 times before moving the message to the dead-letter queue.

Safe version draining

When ICC finds a new version, it checks with the workflow service to see if the old version still has any work in progress. The service looks at active runs, pending hooks, waiting sleeps, and queued messages. ICC only decommissions the old version when all these counts are zero:

If a version stays alive longer than allowed, ICC can force-expire it. This cancels in-flight runs, moves queued messages to the dead-letter queue, and deregisters handlers.

Zero-config in Kubernetes

In production with ICC, you don’t need to write any configuration code. You just set two environment variables in your Dockerfile and add three pod labels in your Deployment spec:

ENV WORKFLOW_TARGET_WORLD="@platformatic/world"
ENV PLT_WORLD_SERVICE_URL="http://workflow.platformatic.svc.cluster.local"

# Pod labels in your Deployment spec
labels:
 app.kubernetes.io/name: my-app
 plt.dev/version: "v1"
 plt.dev/workflow: "true"

The Workflow DevKit discovers the world automatically. At startup, @platformatic/world (the library your app imports) resolves the app ID from the PLT_WORLD_APP_ID env var or the package.json name, detects the deployment version from the plt.dev/version label via the K8s API, and authenticates using the pod’s ServiceAccount token. On the infrastructure side, ICC sees the plt.dev/workflow label and registers queue handlers with @platformatic/workflow, so dispatched messages reach the correct versioned pod.

You don’t need to change your workflow code. The same 'use workflow' and 'use step' directives work just like they do on Vercel.

Local development

For local development, the workflow service runs in single-tenant mode without authentication — no K8s, no ICC. Start PostgreSQL and the workflow service:

npx @platformatic/workflow
postgresql://user:pass@localhost:5432/workflow

Then configure your app to connect to it with the same two environment variables from the Dockerfile above, just pointing at localhost:

WORKFLOW_TARGET_WORLD=@platformatic/world
PLT_WORLD_SERVICE_URL=http://localhost:3042

Your app also needs to call world.start() Once the server starts, this registers a queue handler so the workflow service can dispatch messages back to your app. In K8s with ICC, this is a no-op (ICC handles it). Here’s a Next.js example using instrumentation.ts:

// instrumentation.ts — Next.js calls register() once on server startup
export async function register() {
  if (process.env.PLT_WORLD_SERVICE_URL) {
    const { createWorld } = await import(‘@platformatic/world’)
    const world = createWorld()
    await world.start?.()
  }
}

Other frameworks have different startup hooks (Fastify plugins, Express middleware, etc.) — the key is to call world.start() once before your app starts handling requests.

The service auto-provisions a default application, so no further setup is needed.

Observability in ICC

The ICC dashboard gives you full visibility into your workflow runs. The Workflows tab shows a real-time list of all runs for each application, with status, version, and duration.

Click a run to inspect it. The Trace view shows a waterfall of every step, with timing bars and status indicators. You can see exactly where time was spent and which steps ran in parallel.

The Graph tab visualizes the workflow structure as a directed graph. Sequential steps flow vertically, parallel steps are laid out side-by-side. After the first completed run of a version, the graph pre-renders immediately for subsequent runs — so you see the full structure before the workflow even starts executing.

You can also replay completed runs from the dashboard (targeting the original deployment version), cancel running workflows, and inspect hooks, events, and streams.

Try it

You can find the repository at github.com/platformatic/platformatic-world. The @platformatic/world package is a drop-in replacement for Vercel World. If your workflows run on Vercel today, they’ll work on your cluster with Platformatic World.

We’d love to hear how you use it. Feel free to open an issue or contact us on Discord.

Durable Workflows Beyond Vercel: Version-Safe Orchestration for Kubernetes

The version mismatch problem

We already solved this for HTTP

Why we couldn’t just extend the Intelligent Command Center

How Platformatic World works

Version-aware queue routing

Safe version draining

Zero-config in Kubernetes

Local development

Observability in ICC

Try it

Comments

More from this blog

Stop Pinning Everything: Quantifying Upgrade Risk in Durable Workflows

Run Durable Eve Agents on Kubernetes with Platformatic

Stop Request Stampedes at the Gateway with Platformatic Deduplication

AWS ECS auto-scaler is broken (don’t worry, we’ve fixed it)

Destino: Doom in Your Terminal, Powered by Node.js FFI

Command Palette

The version mismatch problem

We already solved this for HTTP

Why we couldn’t just extend the Intelligent Command Center

How Platformatic World works

Version-aware queue routing

Safe version draining

Zero-config in Kubernetes

Local development

Observability in ICC

Try it

Comments

More from this blog