Two weeks ago, a critical crypto signal API ran uninterrupted in production—until a database check revealed it had missed 22 hours of price updates. The service logs showed green lights. The deployment dashboard flashed all-clear. Even the TCP socket appeared healthy. Yet the WebSocket feed from Binance had gone dormant without a single error. This is the silent staleness problem: a connection that looks alive at every layer except the one that matters.
The issue isn’t rare. Any system consuming long-lived WebSocket streams—price feeds, chat platforms, IoT sensors, or log pipelines—risks this exact failure mode. TCP keepalive can’t save you here. It only verifies the network route, not whether your application data is still flowing.
The myth of "connected"
When a WebSocket connection opens, a TCP handshake establishes the socket. From that moment, "connected" means the network path exists—not that the remote service is actively sending data.
TCP keepalive periodically sends probes to confirm the route remains viable. If the path breaks, the OS eventually closes the socket and raises an error. But TCP keepalive has blind spots:
- It cannot detect whether the remote application stopped pushing messages
- It cannot see if a proxy or load balancer dropped your subscription
- It cannot catch backend bugs that halt event emission while keeping the connection open
In my case, Binance’s WebSocket gateway accepted the connection, accepted the subscriptions, and then simply stopped sending ticker updates. The TCP socket was perfect. The operating system was fine. The code was correct. The data was gone.
Why standard recovery tactics fail
Developers often try quick fixes that don’t address the root cause:
- Reconnect on error: Fails because WebSockets can stay open even when no data arrives.
- Ping the server: Useless if the service doesn’t respond to pings—exchanges often ignore client pings on data streams.
- Check subscription confirmation: Catches startup failures but not mid-stream stalls.
These approaches treat symptoms, not the silent staleness itself. The real solution requires application-level logic.
Building staleness detection at the message layer
The fix is simple: measure the time since the last meaningful message. If it exceeds a threshold, the stream is stale—regardless of what TCP believes.
Here’s a Python implementation using asyncio and websockets:
import asyncio
import json
import time
import websockets
STALENESS_TIMEOUT_SECONDS = 60 # Adjust based on feed frequency
class StaleStreamError(Exception):
pass
async def consume_stream(url, subscribe_message):
while True:
try:
async with websockets.connect(url) as ws:
await ws.send(json.dumps(subscribe_message))
last_message_at = time.time()
async def monitor_staleness():
while True:
await asyncio.sleep(STALENESS_TIMEOUT_SECONDS)
age = time.time() - last_message_at
if age > STALENESS_TIMEOUT_SECONDS:
await ws.close()
raise StaleStreamError(
f"No message for {age:.1f}s "
f"(threshold: {STALENESS_TIMEOUT_SECONDS}s)"
)
staleness_task = asyncio.create_task(monitor_staleness())
try:
async for message in ws:
last_message_at = time.time()
await handle_message(message)
finally:
staleness_task.cancel()
except websockets.exceptions.ConnectionClosed:
print("Connection closed, reconnecting...")
except StaleStreamError as e:
print(f"Staleness detected: {e}, reconnecting...")
await asyncio.sleep(1) # Backoff delayThe key principle: define "alive" at the application layer, not the OS layer.
Tune the timeout based on your feed’s natural gaps. A 60-second threshold may be too aggressive for IoT telemetry; five minutes might be too lenient for high-frequency trading data. A common heuristic sets the timeout at three to five times the longest expected message gap during off-peak periods.
The limits of exchange-provided heartbeats
Some WebSocket protocols include heartbeats—small pings exchanged every few minutes to confirm both ends are operational. Binance Futures, for example, sends pings you must respond to with pongs.
Heartbeats help, but they don’t eliminate the staleness problem because:
- Heartbeat mechanisms might remain active even after data subscriptions fail
- Not all feeds implement heartbeats
- Heartbeats only confirm liveness, not data flow
Treat heartbeats as one data point among many. The true signal is: "Am I receiving the specific messages I subscribed to?"
Reconnect strategies that prevent cascading failures
When staleness triggers a reconnect, avoid naive retries:
- Use exponential backoff to avoid overwhelming a struggling server
- Add jitter to prevent synchronized reconnection storms (e.g., 1000 clients hitting the server at once after an outage)
- Implement state recovery for feeds requiring synchronization (e.g., order books, subscription channels)
- Set alert thresholds—if reconnections exceed N times in M minutes, escalate to on-call
The costly lesson in resilience
The bug that took down my service wasn’t subtle—yet it remained invisible until manually audited. The assumption that "WebSocket connected = data flowing" had silently failed when the assumption became false.
The fix required three layers:
- Message-level staleness detection to catch silent gaps
- External health monitoring exposing
last_signal_age_secondsfor tools like UptimeRobot - Backoff-and-retry logic resilient to transient failures
In streaming systems, visibility is survival. TCP keepalive is a safety net, not a compass. Build your monitoring around the data you expect to receive—not the connection that claims to be alive.
AI summary
WebSocket 'bağlandı' diyor ama veri gelmiyorsa sorun TCP keepalive'de değil. Uygulama katmanında donukluk algılama nasıl yapılır? Pratik Python örneğiyle açıklanıyor.