AI Agents in Production: Orchestration and Security on K8s

AI agents are moving beyond demos and into real production use. In production, you need sessions that last through infrastructure changes, code that stays secure, and controls that your platform team can enforce.

Today, we’re launching two tools to meet these needs. Regina Coordinator handles stateful agent routing and recovery on Kubernetes. eBPF Sandbox enforces security policies for agent-generated code at the process level. Both run together inside Watt and answer the two main questions we hear from infrastructure teams: How can we run agents reliably at scale? And how can we do it without creating security risks?

Part 1: Regina Coordinator. Stateful Sessions on Ephemeral Infrastructure

The problem

When an AI agent manages a multi-turn conversation, it builds up state like conversation history, tool outputs, and files created during the session. This state exists in a specific process on a specific pod. Kubernetes is designed for stateless workloads and doesn’t guarantee pods will last, which doesn’t match how agents work.

If nothing manages this, a pod eviction or rolling deployment means a lost session. A crash at 2 am becomes a support ticket by morning. For agents used in product workflows, these problems aren’t acceptable.

The Regina Coordinator solves this problem. It sits in front of your Regina pods and manages the whole session lifecycle: routing, failure detection, backup, and recovery. Pod restarts, rolling deployments, and crashes are hidden from users. Sessions keep going without interruption.

Why stateful?

Some agent frameworks use a stateless approach. They store all session state in a database, make each request independent, and let any instance handle any request. This works well for simple request-response agents and is easy to manage.

Regina uses a stateful model because agents that run code build up real in-process state over time. An agent that writes and runs code, installs packages, keeps tool connections open, and uses a virtual filesystem is doing work that builds on itself, not just sending messages to a database. Saving and restoring this environment for every message is too costly or limiting. For these agents, the virtual filesystem is the main workspace.

The tradeoff is more operational complexity: stateful sessions need routing, failure recovery, and backup. That’s what the Coordinator is for. The goal is to make stateful sessions work on stateless infrastructure, without putting the burden on application developers.

Architecture

Regina runs inside Watt, Platformatic’s Node.js application server. Watt manages multiple services in one process and handles their lifecycle and communication. Regina builds on this to support stateful agent instances.

The system has three main services:

Regina manages agent definitions, creates and manages agent instances, and handles their lifecycle, including suspension, backup, and restore.
Regina Agent is the runtime for each agent. Each instance runs in its own thread with its own AI model connection, tools, and virtual filesystem.
Regina Coordinator acts as the gateway. It routes requests to the right pod and manages failure recovery across the cluster.

Deployment

One coordinator manages a group of Regina pods. It runs as a standard Kubernetes service and is the entry point for all client traffic. Regina pods run as a headless service without a load balancer, so the coordinator connects to them directly using their pod IPs.

When a Regina pod starts, it registers itself in Redis with its address, instance count, and a 30-second TTL. It refreshes this TTL every 10 seconds. If a pod stops sending heartbeats, because it crashed, was evicted, or lost network connectivity, its keys expire after 30 seconds, and the coordinator stops routing to it. The failure detection window is automatic and limited.

Session lifecycle

When a user starts a new chat, the coordinator picks a pod using one of three allocation strategies:

Round-robin cycles through pods in order.
Least-loaded picks the pod with the fewest active instances.
Random picks a pod at random.

On the chosen pod, Regina creates a new agent instance in its own thread with its own model connection, tools, and virtual filesystem. The instance registers itself in the session store so the coordinator can route all future messages to it.

Every message from the user goes through the coordinator, which checks the session store, forwards it to the right pod, and returns the response. For the user, it feels like one continuous conversation, no matter what happens behind the scenes.

Keeping sessions alive

An agent instance might stop because of a pod crash, a rolling deployment, or idle suspension to save resources. In every case, the conversation needs to be saved. Regina does this by backing up each instance’s virtual filesystem to shared storage. Three backends are supported:

S3 for production deployments, also compatible with MinIO and Cloudflare R2
Redis for smaller deployments, storing the filesystem as a hash entry
Filesystem for shared volumes like NFS or EFS

Idle suspension: If no messages come in for five minutes, Regina backs up the virtual filesystem and stops the agent thread. When a new message arrives, the coordinator sends it to the same pod. Regina notices the instance is suspended, restores the backup, restarts the thread, and the conversation continues right where it left off.

Graceful shutdown: During rolling deployments or scale-downs, Regina backs up all active instances before the pod shuts down. Nothing is lost.

Crash recovery: If a pod crashes without a graceful shutdown, only the last backup is available. After 30 seconds, the pod’s keys expire in Redis, and the coordinator finds the orphaned session: the instance mapping is there, but the pod is gone. The coordinator picks a healthy pod, forwards the request, and Regina restores the virtual filesystem from shared storage and restarts the agent thread. The conversation continues on the new pod without any interruption for the user.

API

The coordinator provides a REST API for agent discovery, session management, and chat.

Agent discovery gathers definitions from all registered pods and returns a deduplicated list, so clients always know which agents are available across the cluster.

Session management includes creating, deleting, and listing instances. New instances are placed using the chosen allocation strategy. Listing shows instances across all pods for a given agent definition.

Chat supports two modes: synchronous, which returns a single JSON response, and streaming, which delivers tokens as NDJSON. In streaming mode, the coordinator passes the response directly from the pod without buffering.

Part 2: eBPF Sandbox. Security Enforcement for Agent-Generated Code

The problem

Reliability is only half of what’s needed in production. The other half is security.

An agent that can run arbitrary code, install packages, execute shell commands, and call external APIs is like running untrusted software on your infrastructure. The code is generated at runtime, changes with every request, and the agent decides what to run on its own. For many teams, this real security risk keeps agents in demo environments instead of production, because existing controls weren’t built for this: cgroups, and network controls at the CNI or service-mesh layer. These matters, but agent workloads need a different control model. The key difference is that agents are processes with changing intent, not static applications with known behaviour at deploy time. Standard controls assume you know what the process will do when you deploy it. Agents don't have that property.

The requirements are specific: only certain outbound destinations should be allowed, only certain binaries should run after startup, resource limits should apply to each agent process, and policies need to be able to tighten while the process is running, not just at container start.

That’s exactly what eBPF Sandbox is designed to handle.

What it does

eBPF Sandbox is a Linux tool for isolating processes. It uses Linux namespaces for user, mount, and PID isolation, cgroups for resource limits, eBPF hooks for runtime policy enforcement, and seccomp to block critical syscalls. A small client/server control plane sets up and activates each sandbox.

The main reason to use eBPF is that it enforces policy directly in the kernel, not through a wrapper library, special runtime, or sidecar. Once the sandbox is active, the same boundaries apply to the whole process tree, including any child processes started by agent-generated code.

The system has two parts: a client-side launcher that prepares and starts sandboxed processes, and a server-side daemon that manages cgroups, loads eBPF programs, and activates policies.

Namespaces control what the process can see. Cgroups control what it can use. eBPF and seccomp control what it can do.

A policy example

A sandbox policy brings together process, network, and resource controls in one definition. The main question it answers is: what should this process be allowed to see, use, run, and access?

{
 "presets": ["posix-ro", "node"],
 "network": {
   "rules": [
     { "action": "deny", "destination": "169.254.169.254", "note": "block metadata service" },
     { "action": "allow", "destination": "*.anthropic.com", "port": 443 },
     { "action": "allow", "destination": "10.0.0.0/8" }
   ]
 },
 "resources": {
   "memoryLimit": "1G",
   "cpuLimit": "50000 100000"
 }
}

This policy lets the sandbox call a specific external API over HTTPS, reach internal services on the private network, block access to the instance metadata service, use an approved runtime and a minimal set of POSIX tools, and stay within set resource limits.ts.

Presets like posix-ro and node are bundles of common permissions, so you don’t need long binary allowlists. They make policies easier to read and reuse across agent definitions.

**No pre-enforcement execution window
**The most important security feature is sequencing. A naive approach starts the process, moves it into the right cgroup, attaches enforcement, and hopes nothing happens in the gap. Even a small timing window lets a process open a socket, fork a child, or act outside the intended limits.ry.

eBPF Sandbox removes this timing window completely. Enforcement is active before the sandboxed command runs its first instruction. The isolated filesystem is ready before the process starts, so there’s no unprotected startup phase.

The runtime view

To the process, the sandbox looks like a minimal runtime environment, not a full container image or host. It provides:

/workspace as its writable working directory
/usr, /lib, and /lib64 as read-only runtime dependencies
/bin and /sbin as read-only entry points
A minimal /etc for name resolution and basic user lookups
/proc for process introspection
/tmp as a private tmpfs

You can expose extra host paths as read-only mounts if a workload needs access to certain data or configuration. The idea is to expose only what the process really needs, and only as read-only whenever possible.

Network policy

The network model works like security group rules, but applied at the process level instead of the container or pod level.

IP and CIDR rules handle the straightforward cases: blocking the metadata service, allowing RFC1918 ranges, or denying all outbound except specific destinations.

Hostname rules are more useful for agent workloads. The kernel doesn’t see "connect to api.example.com”, it sees "connect to this IP." For TLS traffic, the sandbox checks the hostname during the TLS Server Name Indication handshake. This makes policies like allowing *.anthropic.com and denying everything else accurate and reliable.

This matters because many services are behind shared infrastructure like CDNs, cloud load balancers, and anycast front doors. At the IP level, different hostnames look the same. TLS hostname inspection lets you write policies based on the services you actually want to allow, which is usually what teams mean by network policy. Policy is treated as data, not something fixed at process startup. The daemon writes policy into the kernel, and changes can take effect without restarting the sandboxed process.

This is important for agent workflows where permissions need to change during a session. For example, you might allow a package download during setup, remove that access once the environment is ready, or temporarily allow an internal service for a specific step, then revoke it. Static policy at container start can’t handle this, but live policy updates can.

The system also supports a global policy ceiling. The platform team sets the outer boundary once, and individual sandbox policies can be more restrictive but never more permissive. Application teams can narrow the boundary for their workloads, but can’t go beyond the platform limit.

Kubernetes deployment

In Kubernetes, the sandbox daemon runs as a DaemonSet: one daemon per node. This setup works because the daemon needs direct node access to manage host processes, create and manage cgroups, and load and maintain eBPF programs and maps. The DaemonSet pattern gives it that access without needing privileged containers for the agent workloads.

Why do these two components belong together?

Reliability and security go hand in hand. An agent system that recovers from pod failures but runs code without enforcement isn’t ready for production. A system with strong sandboxing but weak session routing creates other operational problems.

Regina Coordinator and eBPF Sandbox are built to work together in production. The coordinator keeps agent sessions running through the infrastructure events that always happen on Kubernetes. The sandbox makes sure the code those agents run is safe on your own infrastructure.

Both tools run inside Watt. There are no managed service tradeoffs, no black-box runtime, and no vendor lock-in for your execution environment. You keep control of your infrastructure, and these components help make that control practical.

For more details on how agent instances work, including definitions, tools, the AI loop, and session persistence, see the companion article. Documentation for both components is at docs.platformatic.dev.

Agents in Production: Reliable Orchestration and Security Enforcement on Kubernetes

Part 1: Regina Coordinator. Stateful Sessions on Ephemeral Infrastructure

The problem

Why stateful?

Architecture

Deployment

Session lifecycle

Keeping sessions alive

API

Part 2: eBPF Sandbox. Security Enforcement for Agent-Generated Code

The problem

What it does

A policy example

The runtime view

Network policy

Kubernetes deployment

Why do these two components belong together?

Comments

More from this blog

Stop Request Stampedes at the Gateway with Platformatic Deduplication

AWS ECS auto-scaler is broken (don’t worry, we’ve fixed it)

Destino: Doom in Your Terminal, Powered by Node.js FFI

Ahead of Time Scaling: How Platformatic ICC Predicts and Provisions

Run Medusa on Kubernetes with Watt as a Monorepo

Command Palette

Part 1: Regina Coordinator. Stateful Sessions on Ephemeral Infrastructure

The problem

Why stateful?

Architecture

Deployment

Session lifecycle

Keeping sessions alive

API

Part 2: eBPF Sandbox. Security Enforcement for Agent-Generated Code

The problem

What it does

A policy example

The runtime view

Network policy

Kubernetes deployment

Why do these two components belong together?

Comments

More from this blog