Making GraphQL Subscriptions Resilient to WebSocket Failures

GraphQL Subscriptions let clients receive live updates directly from the server. Instead of polling with queries, a subscription keeps an open connection so the server can push new results the moment something changes.
This makes them ideal for soft real-time applications like chat systems, live dashboards, or notifications, all while keeping the benefits of GraphQL’s type safety and query flexibility.
Under the hood, subscriptions usually rely on WebSockets, which allow the client and server to maintain a continuous two-way conversation as long as the connection stays open.
GraphQL server and client example

Let’s build a simple chat with subscriptions in node.js using fastify and mercurius on the server and a plain WebSocket client.
Server
Here is our server-simple.js, complete code at https://github.com/platformatic/blog-graphql-subscription/blob/main/src/server-simple.js.
import Fastify from 'fastify';
import mercurius from 'mercurius';
const app = Fastify();
const schema = `
type Message {
id: ID!
text: String!
user: String!
at: String!
}
type Query {
messages: [Message]
}
type Mutation {
sendMessage(text: String!, user: String!): Message
}
type Subscription {
onMessage: Message
}
`;
const storage = { messages: [] };
const resolvers = {
Query: {
messages: () => storage.messages,
},
Mutation: {
sendMessage: async (_, { text, user }, { pubsub }) => {
const message = {
id: randomUUID(),
text,
user,
at: new Date().toISOString(),
};
storage.messages.push(message);
// Publish to subscription
await pubsub.publish({
topic: 'MESSAGE_SENT',
payload: { onMessage: message },
});
return message;
},
},
Subscription: {
onMessage: {
subscribe: async (_, __, { pubsub }) => {
return pubsub.subscribe('MESSAGE_SENT');
},
},
},
};
app.register(mercurius, {
schema,
resolvers,
subscription: true
});
// ...
await app.listen({ port: 4000, host: '0.0.0.0' });
Whenever sendMessage is called, the server publishes a new message, which all clients subscribed to receive onMessage.
Client
Our client is straightforward as well:client.js, complete code at https://github.com/platformatic/blog-graphql-subscription/blob/main/src/client.js
import WebSocket from 'ws';
export class GraphQLClient {
async connect() {
return new Promise((resolve, reject) => {
this.ws = new WebSocket(this.url, 'graphql-ws');
this.ws.on('open', () => {
// Initialize connection
const payload = {
type: 'connection_init',
payload: {},
};
this.ws.send(JSON.stringify(payload));
});
this.ws.on('message', (data) => {
const msg = JSON.parse(data.toString());
if (msg.type === 'connection_ack') {
this.connected = true;
resolve();
} else if (msg.type === 'data' && msg.payload?.data?.onMessage) {
const message = msg.payload.data.onMessage;
handler(message);
} else if (msg.type === 'error') {
reject(new Error(msg.payload));
}
});
});
}
subscribe(onMessage) {
// Send subscription
this.ws.send(
JSON.stringify({
id: subscriptionId,
type: 'start',
payload: {
query: `subscription {
onMessage { id, text, user, at }
}`,
},
}),
);
this.subscriptions.set(subscriptionId, onMessage);
return subscriptionId;
}
async sendMessage(user, text) {
const mutation = `
mutation SendMessage($text: String!, $user: String!) {
sendMessage(text: $text, user: $user) { id, text, user, at }
}`;
const response = await fetch('http://localhost:4000/graphql', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
query: mutation,
variables: { text, user }
})
});
const result = await response.json();
if (result.errors) {
throw new Error(result.errors[0].message);
}
\
return result.data.sendMessage;
}
}
This setup works fine: clients receive real-time messages as soon as they’re published.
What About Reliability?
So far, it's been good, but what happens if the server crashes or restarts? What if the network drops, even briefly?
In those cases, the client silently misses messages, without knowing they were lost, and that’s a serious problem.
The main reason is that the WebSockets protocol doesn't have built-in message delivery guarantees.
The possible semantics are:
At most once: fire-and-forget, no acknowledgment.
At least once: requires an ACK, but duplicates are possible.
Exactly once: guaranteed single delivery with handshakes.
WebSockets behave like at-most-once: if a packet gets lost, it’s gone.
GraphQL Subscriptions Are Stateful
With Queries and Mutations, each request is stateless: the client sends a request, the server responds, and the interaction is done. If the network drops, the client can retry without losing context.
Subscriptions, on the other hand, rely on a long-lived WebSocket connection. Because WebSockets don’t provide delivery guarantees, the stream of updates is interrupted if the connection breaks.
This lack of delivery guarantees makes subscriptions effectively stateful: both client and server must maintain a shared understanding of the connection’s state. If that state is lost, so are the updates. The server may still think the client is subscribed, while the client may assume the server has nothing new to send, and in both cases, the client silently misses messages.
Subscriptions are tricky because you can’t just “retry” the last request. You need a mechanism to detect a lost state and re-establish it.

This problem becomes even more critical in distributed systems, where write operations (Mutations) and notifications (Subscriptions) are often handled by different services asynchronously. A message can be successfully accepted and stored in these cases, but the corresponding notification may never be delivered, leading to silent data loss.
Making Subscriptions Reliable
To make subscriptions reliable, we need to strengthen them on multiple fronts:
On the GraphQL service: subscriptions must be resumable, so if a client disconnects or the server restarts, the client can reconnect without losing messages. This requires introducing cursors or checkpoints that allow the stream to pick up exactly where it left off.
On the infrastructure: reliability depends on monitoring the internal connections between services and the GraphQL layer. If the server crashes, becomes unresponsive, or the network fails, the backend should detect the problem quickly, restart the subscription, and guarantee that message delivery continues seamlessly.
On the client, it’s the opposite of the service. A well-behaved client monitors its own WebSocket connection, detects interruptions, and resubscribes automatically. Tracking the last message it processed allows the client to request only what it missed, ensuring consistency even through transient failures.
From Stateful to Stateless Subscriptions
To overcome GraphQL's limitation on subscriptions, we shifted subscriptions from being purely stateful to stateless and resumable by introducing the concept of a cursor. A cursor is a lightweight marker that identifies the last message a client has successfully processed. It can be as simple as a monotonically increasing message ID, a timestamp, or an offset.
server-with-resume.js is similar to the server-simple.js but the Subscription is resumable as - full code at https://github.com/platformatic/blog-graphql-subscription/blob/main/src/server-with-resume.js
const schema = `
...
type Subscription {
onMessage(id: String): Message
}
`;
// ...
Subscription: {
onMessage: {
subscribe: async (_, { id }, { pubsub }) => {
if (!id) {
return pubsub.subscribe('MESSAGE_SENT');
}
// If an id is provided, send all messages starting from that id first
// Find the index of the message with the given id
const startIndex = storage.messages.findIndex((msg) => msg.id === id);
if (startIndex !== -1) {
const messagesToSend = storage.messages.slice(startIndex + 1);
// Create a custom async iterator that first sends existing messages
// then subscribes to new ones
return (async function* () {
for (const message of messagesToSend) {
yield { onMessage: message };
}
const subscription = await pubsub.subscribe('MESSAGE_SENT');
let resumeFromNext = false;
for await (const message of subscription) {
if (messagesToSend.length === 0 || resumeFromNext) {
yield message;
} else if (
message.onMessage.id ===
messagesToSend[messagesToSend.length - 1].id
) {
resumeFromNext = true;
}
}
})();
}
},
},
},
};
End-to-end guarantees with Client and Server
Reconnection and resume logic on the Client
For a system that can withstand network interruptions or server restarts, the client must reconnect gracefully and continue without losing messages.
Hardening the Client
A resilient client should:
Monitor its WebSocket connection and automatically reconnect if it drops
Track the last message that has been successfully processed
Resubscribe with that cursor after reconnecting, so that any missed messages are replayed
This shifts subscriptions away from a fragile, connection-bound state toward a more reliable, stateless flow.
Keep the Connection On
The client's first responsibility is simply to keep its WebSocket connection alive. This means actively detecting when the connection drops (whether due to a network hiccup, server restart, or idle timeout) and re-establishing it without user intervention.
A common strategy is to send periodic pings or rely on heartbeat signals to confirm the server is still responsive. If no response arrives within a given timeframe, the client closes the stale connection and immediately attempts to reconnect. By treating the WebSocket as something that must be continuously monitored, the client avoids situations where it appears connected but no longer receives any data.
Replay Missed Messages
Reconnecting alone is not enough. To ensure continuity, the client needs to restore its subscription state:
Each message includes a cursor, such as an ID, timestamp, or offset
The client remembers the last cursor it processed
On reconnect, it resubscribes with that cursor (for example: “send me messages after ID 42”)
The server replays the missed messages first, then seamlessly switches back to live updates
With this approach, subscriptions become stateless and resumable. Even in the face of crashes or network drops, the client receives continuous, uninterrupted messages without data loss.
Here is a simplified example of a client that handles reconnection and resume on subscriptions, full implementation at https://github.com/platformatic/blog-graphql-subscription/blob/main/src/client.js
import WebSocket from 'ws';
class GraphQLClient {
async connect() {
return new Promise((resolve, reject) => {
this.ws = new WebSocket(this.url, 'graphql-ws');
this.ws.on('open', () => {
this.ws.send(JSON.stringify({
type: 'connection_init',
payload: {},
}));
});
this.ws.on('message', (data) => {
const msg = JSON.parse(data.toString());
if (msg.type === 'connection_ack') {
this.connected = true;
this.isAlive = true;
this.startPing();
this.subscribe();
resolve();
} else if (msg.type === 'data' && msg.payload?.data?.onMessage) {
const message = msg.payload.data.onMessage;
this.lastMessageId = message.id;
console.log('Received message:', message);
}
});
this.ws.on('pong', () => {
this.isAlive = true;
});
this.ws.on('close', () => {
this.connected = false;
this.stopPing();
if (!this.reconnecting) {
this.reconnect();
}
});
});
}
startPing() {
this.pingInterval = setInterval(() => {
if (!this.ws || !this.connected) return;
if (!this.isAlive) {
this.ws.terminate();
return;
}
this.isAlive = false;
this.ws.ping();
}, 30_000);
}
async reconnect() {
if (this.reconnecting) return;
this.reconnecting = true;
await new Promise(resolve => setTimeout(resolve, 1000));
try {
await this.connect();
this.subscribe(this.lastMessageId);
} finally {
this.reconnecting = false;
}
}
subscribe(fromMessageId = null) {
const query = `subscription ${fromMessageId ? 'OnMessageFromId($id: String)' : ''} {
onMessage${fromMessageId ? '(id: $id)' : ''} {
id
text
user
at
}
}`;
this.ws.send(JSON.stringify({
type: 'start',
payload: {
query,
...(fromMessageId && { variables: { id: fromMessageId } })
},
}));
}
}
Can we harden against Server crashes and shutdowns, too?
Having clients reconnect is also often problematic; some would not implement the stateful flow. However, we must be able to restart and scale horizontally the WebSocket servers without impacting the clients. How can we handle those situations? By adding a proxy.

Why Do We Need a Proxy?
An intermediate proxy service takes over reliability concerns from the GraphQL server. By sitting between clients and the backend, the proxy ensures connections are monitored, recovered, and replayed, enabling scalability even in the face of failures.
The proxy monitors WebSocket connections for signs of trouble, such as server crashes, network drops, or idle timeouts. If the connection is lost, it automatically reconnects to the GraphQL server. On reconnect, the proxy transparently resends the subscription operation so that the state is re-established and the client continues receiving updates as if nothing happened.
This approach ensures that transient issues don’t interrupt the subscription stream. Clients remain connected to the proxy, unaware of backend failures, while the proxy handles the heavy lifting of detecting, repairing, and restoring the connection.
Centralized Reconnection for All Clients
The proxy keeps the connection to the GraphQL service active even when the service crashes, restarts, becomes unresponsive, or suffers from network interruptions. Without the proxy, these failures would immediately affect every connected client. Instead, the proxy absorbs the problem and maintains the client connection, so the disruption never reaches the end user.
The reconnection logic is the same as what a well-designed client would implement, but here the proxy applies it on behalf of all connected clients and subscriptions. This centralizes the complexity, ensuring consistent reliability without requiring every client to manage its own recovery logic.
Resume Subscriptions and Track State
Reconnecting to the server is only half the solution. To avoid losing messages during downtime, the proxy must also track the state of each subscription. It does this by recording the last message successfully delivered to every client, using a cursor such as an ID, offset, or timestamp.
When the proxy reconnects to the GraphQL service, it resubscribes with that cursor and requests all messages that came after it. The server replays the missed events, and once the stream is caught up, the proxy switches back to live updates.
From the client’s perspective, the feed is seamless: no duplicates, no gaps, and no interruptions, even if the backend was down in the middle of message delivery.
Extending Reliability with Kafka
In systems that rely on Apache Kafka, reliability can be enhanced by integrating GraphQL subscriptions directly with Kafka streams.
At Platformatic, we lead the Node.js and Kafka bridge with the @platformatic/kafka driver, ensuring stability and scalability across both worlds.
The service attaches to a Kafka topic and consumes messages in two complementary ways:
Main consumer: runs in latest mode and streams new messages to clients as they arrive
Cursor consumer: runs in cursor mode and is used to resume a subscription from the last known message cursor
When a client reconnects and requests a resume, the service creates a temporary stream for the topic starting from the requested cursor. Once this resuming stream has caught up with the main stream, the temporary stream is closed, and the client is seamlessly switched back to the main stream for live updates.
This pattern ensures that clients never miss messages, even across disconnects, while keeping the live stream efficient and lean.
Building the Proxy
To make our solution as agnostic as possible to the underlying GraphQL server, we introduced an intermediate proxy service using @fastify/http-proxy.
The proxy sits between the client and the GraphQL server. Its job is to:
Monitor WebSocket connections
Detect disconnections or unresponsive targets
Track subscription state and message cursors
Automatically reconnect and resume subscriptions

How it Works
The client connects to the proxy and subscribes to a GraphQL subscription
The proxy forwards the subscription to the actual GraphQL server
If the server crashes, shuts down, or becomes unresponsive, the proxy detects it using ping/pong messages
By default, the connection is pinged every 30 seconds
If no pong or data arrives in time, the connection is marked as broken
The proxy closes and reopens the WebSocket to the server
During recovery, the client remains connected to the proxy without noticing any disruption
Once the server is back online, the proxy restores the subscription from the last known message cursor, ensuring no messages are lost
This makes the whole mechanism transparent to the client: the proxy automatically absorbs the failures and resumes the stream.
The proxy itself is a fastify service with proxy and WebSocket capabilities. It forwards subscriptions to the GraphQL server, tracks active subscriptions, and stores message identifiers to replay missed messages when the backend recovers.
Implementation Example
Here is a minimal example of how we use the proxy together with @platformatic/graphql-subscriptions-resume to track subscriptions and recover from failures
proxy.js, complete code at https://github.com/platformatic/blog-graphql-subscription/blob/main/src/proxy.js
import fastify from 'fastify';
import fastifyHttpProxy from '@fastify/http-proxy';
import { StatefulSubscriptions } from '@platformatic/graphql-subscriptions-resume';
const state = new StatefulSubscriptions({
subscriptions: [{ name: 'onMessage', key: 'id' }],
});
const wsHooks = {
// Remove all the subscriptions for the source/client
onDisconnect: (_context, source, _target) => {
state.removeAllSubscriptions(source.id);
},
// Restore the subscription state
onReconnect: (_context, source, target) => {
state.restoreSubscriptions(source.id, target);
},
// Add subscription to the tracking state
onIncomingMessage: (_context, source, _target, message) => {
const m = JSON.parse(message.data.toString('utf-8'));
if (m.type !== 'start') { return }
state.addSubscription(source.id, m.payload.query, m.payload.variables);
},
// Update the subscription state on forward message to the client
onOutgoingMessage: (_context, source, _target, message) => {
const m = JSON.parse(message.data.toString('utf-8'));
if (m.type === 'data') {
state.updateSubscriptionState(source.id, m.payload.data);
}
},
};
export async function start() {
const wsReconnect = {
logs: true,
pingInterval: 30_000,
reconnectOnClose: true,
};
app = fastify();
app.register(fastifyHttpProxy, {
upstream: 'http://localhost:4000/graphql',
prefix: '/graphql',
websocket: true,
wsUpstream: 'ws://localhost:4000/graphql',
wsReconnect,
wsHooks,
});
}
Simulation
To prove the approach under stress, we ran a simulation that generates massive traffic while an unstable GraphQL service injects failures in the notification path.
The server-unstable.js works much like server-with-resume.js in logic, but its implementation deliberately injects failures. It can simulate an unresponsive server, a dropped connection, or a graceful shutdown (such as during a restart), giving us a controlled way to test how the proxy handles real-world instability.
Full script here https://github.com/platformatic/blog-graphql-subscription/blob/main/src/demo.js
What the simulation does
Floods the system with concurrent mutations and subscription events
Introduces failures on Subscriptions
Keeps the client connected to the proxy while the proxy reconnects upstream and replays missed messages from the last cursor
Tracks end-to-end delivery and checks for gaps
No message loss!
Result
📊 ======================================
📊 DETAILED MESSAGE DELIVERY STATISTICS
📊 ======================================
👥 PER-CLIENT STATISTICS:
═══════════════════════════════════════
🔹 Client 1:
📤 Sent: 203
📥 Received: 2000 ✅
❌ Lost Messages: 0 ✅
📊 Delivery Rate: 100.00% ✅
...
🔹 Client 10:
📤 Sent: 203
📥 Received: 2000 ✅
❌ Lost Messages: 0 ✅
📊 Delivery Rate: 100.00% ✅
👥 OVERALL STATISTICS:
═══════════════════════════════════════
🕐 Runtime: 32.8s
📊 Sent Messages: 2000
📊 Delivery Rate: 100.00% ✅
📊 Duplicated Messages: 0 ✅
⚡ Messages/Second: 61.00
📊 ======================================
Takeaways
With the proxy in place, subscriptions become stable and resilient:
No more silent message loss: missed updates are detected and replayed
Automatic recovery from server crashes or restarts
Subscriptions stay real-time, but gain the reliability of a stateless request-response cycle
Despite failures in the Subscriptions, every message the system accepted reached the client in order. The proxy handled reconnection and resume, and the cursor logic filled any gaps before switching back to the live stream.
By shifting subscription state management into the proxy, we make GraphQL Subscriptions reliable enough for mission-critical systems without requiring any changes on the client side.
And when this is combined with client-side connection control and resume logic, you can guarantee that no messages are ever lost, achieving true end-to-end reliability.






