Why WebSocket connections lie—and how to stop silent data failures

Two weeks ago, a critical crypto signal API ran uninterrupted in production—until a database check revealed it had missed 22 hours of price updates. The service logs showed green lights. The deployment dashboard flashed all-clear. Even the TCP socket appeared healthy. Yet the WebSocket feed from Binance had gone dormant without a single error. This is the silent staleness problem: a connection that looks alive at every layer except the one that matters.

The issue isn’t rare. Any system consuming long-lived WebSocket streams—price feeds, chat platforms, IoT sensors, or log pipelines—risks this exact failure mode. TCP keepalive can’t save you here. It only verifies the network route, not whether your application data is still flowing.

The myth of "connected"

When a WebSocket connection opens, a TCP handshake establishes the socket. From that moment, "connected" means the network path exists—not that the remote service is actively sending data.

TCP keepalive periodically sends probes to confirm the route remains viable. If the path breaks, the OS eventually closes the socket and raises an error. But TCP keepalive has blind spots:

It cannot detect whether the remote application stopped pushing messages
It cannot see if a proxy or load balancer dropped your subscription
It cannot catch backend bugs that halt event emission while keeping the connection open

In my case, Binance’s WebSocket gateway accepted the connection, accepted the subscriptions, and then simply stopped sending ticker updates. The TCP socket was perfect. The operating system was fine. The code was correct. The data was gone.

Why standard recovery tactics fail

Developers often try quick fixes that don’t address the root cause:

Reconnect on error: Fails because WebSockets can stay open even when no data arrives.
Ping the server: Useless if the service doesn’t respond to pings—exchanges often ignore client pings on data streams.
Check subscription confirmation: Catches startup failures but not mid-stream stalls.

These approaches treat symptoms, not the silent staleness itself. The real solution requires application-level logic.

Building staleness detection at the message layer

The fix is simple: measure the time since the last meaningful message. If it exceeds a threshold, the stream is stale—regardless of what TCP believes.

Here’s a Python implementation using asyncio and websockets:

import asyncio
import json
import time
import websockets

STALENESS_TIMEOUT_SECONDS = 60  # Adjust based on feed frequency

class StaleStreamError(Exception):
    pass

async def consume_stream(url, subscribe_message):
    while True:
        try:
            async with websockets.connect(url) as ws:
                await ws.send(json.dumps(subscribe_message))
                last_message_at = time.time()

                async def monitor_staleness():
                    while True:
                        await asyncio.sleep(STALENESS_TIMEOUT_SECONDS)
                        age = time.time() - last_message_at
                        if age > STALENESS_TIMEOUT_SECONDS:
                            await ws.close()
                            raise StaleStreamError(
                                f"No message for {age:.1f}s "
                                f"(threshold: {STALENESS_TIMEOUT_SECONDS}s)"
                            )

                staleness_task = asyncio.create_task(monitor_staleness())
                try:
                    async for message in ws:
                        last_message_at = time.time()
                        await handle_message(message)
                finally:
                    staleness_task.cancel()
        except websockets.exceptions.ConnectionClosed:
            print("Connection closed, reconnecting...")
        except StaleStreamError as e:
            print(f"Staleness detected: {e}, reconnecting...")
            await asyncio.sleep(1)  # Backoff delay

The key principle: define "alive" at the application layer, not the OS layer.

Tune the timeout based on your feed’s natural gaps. A 60-second threshold may be too aggressive for IoT telemetry; five minutes might be too lenient for high-frequency trading data. A common heuristic sets the timeout at three to five times the longest expected message gap during off-peak periods.

The limits of exchange-provided heartbeats

Some WebSocket protocols include heartbeats—small pings exchanged every few minutes to confirm both ends are operational. Binance Futures, for example, sends pings you must respond to with pongs.

Heartbeats help, but they don’t eliminate the staleness problem because:

Heartbeat mechanisms might remain active even after data subscriptions fail
Not all feeds implement heartbeats
Heartbeats only confirm liveness, not data flow

Treat heartbeats as one data point among many. The true signal is: "Am I receiving the specific messages I subscribed to?"

Reconnect strategies that prevent cascading failures

When staleness triggers a reconnect, avoid naive retries:

Use exponential backoff to avoid overwhelming a struggling server
Add jitter to prevent synchronized reconnection storms (e.g., 1000 clients hitting the server at once after an outage)
Implement state recovery for feeds requiring synchronization (e.g., order books, subscription channels)
Set alert thresholds—if reconnections exceed N times in M minutes, escalate to on-call

The costly lesson in resilience

The bug that took down my service wasn’t subtle—yet it remained invisible until manually audited. The assumption that "WebSocket connected = data flowing" had silently failed when the assumption became false.

The fix required three layers:

Message-level staleness detection to catch silent gaps
External health monitoring exposing last_signal_age_seconds for tools like UptimeRobot
Backoff-and-retry logic resilient to transient failures

In streaming systems, visibility is survival. TCP keepalive is a safety net, not a compass. Build your monitoring around the data you expect to receive—not the connection that claims to be alive.

AI summary

WebSocket 'bağlandı' diyor ama veri gelmiyorsa sorun TCP keepalive'de değil. Uygulama katmanında donukluk algılama nasıl yapılır? Pratik Python örneğiyle açıklanıyor.

Why WebSocket connections lie—and how to stop silent data failures

The myth of "connected"

Why standard recovery tactics fail

Building staleness detection at the message layer

The limits of exchange-provided heartbeats

Reconnect strategies that prevent cascading failures

The costly lesson in resilience

Comments

NvChad: Turn your terminal into a powerful code editor

How real-time AI pipelines can slash energy use without new hardware

Track AI spend by feature, not just totals, to cut costs 30-40%