How We Fixed a Black Friday Outage in 3 Hours with Kafka Streams

When Black Friday traffic surged to 2,000 concurrent users, our event platform’s leaderboard system collapsed under the load. What started as a simple Rails application with a PostgreSQL counter column for tracking treasure hunt rewards became a critical failure point. Within hours, our error rate skyrocketed from 0.2% to 18%, and we lost $47,000 in unredeemed rewards before the system could even scale up.

The Root Cause: A Locked Counter in PostgreSQL

Our initial design was straightforward—a PostgreSQL table with a SERIAL column acting as a counter for each hunt. Under normal conditions, this worked fine. But when Black Friday traffic spiked, the simplicity became a bottleneck. PostgreSQL’s row-level locks escalated to table-level locks for SERIAL columns, blocking every leaderboard update. The system didn’t just slow down; it began failing writes entirely, throwing "could not serialize access due to concurrent update deadlocks" errors. The issue wasn’t just performance—it was a complete breakdown of the reward mechanism.

First Fix: Sharding the Counter (And the Problems It Created)

Our first attempt to resolve the issue was sharding the PostgreSQL counter by hunt ID, splitting the hot row into 1,024 partitions. This reduced lock contention, but it introduced a new set of challenges. Each hunt now required its own sequence, forcing our Rails code to route writes to the correct shard. The result? A 400ms latency increase on leaderboard queries, as the system had to union results across 1,024 tables.

Worse, PostgreSQL sequences left gaps of up to 1,024 when nodes restarted, causing our reward payouts to be off by thousands during high-traffic hunts. Redis caching didn’t help either—leaderboard queries involved point lookups across 1,024 tables, and Redis couldn’t efficiently pipeline those requests. The sharding approach had traded one failure for another, without solving the core scalability problem.

The Architecture Overhaul: Kafka Streams and Event Sourcing

Frustrated by the limitations of our PostgreSQL-based approach, we decided to rebuild the system from scratch. The new architecture, dubbed HuntStream, replaced the PostgreSQL counter with an event-sourced system built on Kafka Streams. Every action in the treasure hunt—earning points, claiming rewards—was recorded as an immutable event in a Kafka topic. We then built a materialized view on top of RocksDB to consume these events and maintain the current leaderboard state in memory. This view was partitioned by hunt ID, meaning leaderboard queries only needed to access a single RocksDB partition per hunt.

RocksDB’s built-in caching kept hot leaderboards in memory, while cold ones fell back to disk. The tradeoff was operational complexity—we now ran a Kafka cluster, three Streams applications, and had to monitor RocksDB compaction pauses. But the benefits were undeniable: exactly-once semantics, horizontal scalability, and the ability to replay events in case of corruption.

The Results: Zero Errors and Lightning-Fast Leaderboards

After deploying HuntStream to 100% of traffic, the improvements were immediate and dramatic. Under the same 2,000 concurrent writes, our error rate plummeted from 18% to just 0.02%. Leaderboard latency dropped from 400ms to a mere 12ms at the 99th percentile. Our Kafka brokers handled 45,000 events per second, with 90% of those events processed in under 5ms end-to-end. The RocksDB materialized views used 1.8GB of RAM per hunt instance, and we scaled horizontally by adding more containers during traffic spikes.

The most surprising win was in reward-payout correctness. The event log allowed us to replay events and reconcile payouts exactly, eliminating the gaps caused by PostgreSQL sequence restarts. No more surprises when reconciling rewards—just accurate, auditable results.

Lessons Learned: What We’d Do Differently Next Time

Hindsight is 20/20, but some lessons are worth repeating. First, we wouldn’t have sharded the PostgreSQL counter in the first place. Sharding added as much complexity as our eventual solution, but without the scalability benefits. We also severely underestimated the cost of RocksDB compaction. During our first major traffic spike, one hunt’s RocksDB instance paused compaction for 8 seconds, causing temporary leaderboard staleness. We had to fine-tune compaction intervals and increase disk IOPS to prevent a recurrence.

For future projects, we’d seriously consider using a managed stream processing platform like Confluent Cloud or Redpanda instead of self-hosting Kafka, unless on-prem control was absolutely necessary. Finally, we’d implement an integration test that simulates 5,000 concurrent writes and verifies leaderboard correctness before every deployment. Our post-incident test suite only covered latency, not correctness—and we paid the price for that oversight.

Looking back, the Black Friday outage was a brutal lesson in scalability and reliability. But by embracing event sourcing and stream processing, we not only fixed the problem—we built a system that can handle traffic spikes with ease. The next time our platform faces a surge, it won’t just survive—it’ll thrive.

AI summary

Black Friday trafiğinde çöken liderlik sistemiyle 47 bin dolarlık kayıp yaşayan Veltrix, nasıl PostgreSQL’den Kafka tabanlı olay kaynağı sistemine geçti? Performans verileri ve mimari detaylar.

How We Fixed a Black Friday Outage in 3 Hours with Kafka Streams

The Root Cause: A Locked Counter in PostgreSQL

First Fix: Sharding the Counter (And the Problems It Created)

The Architecture Overhaul: Kafka Streams and Event Sourcing

The Results: Zero Errors and Lightning-Fast Leaderboards

Lessons Learned: What We’d Do Differently Next Time

Comments

Why Companies Should Focus on Operations, Not Build Tech Stacks

Cut Aider AI coding costs with a single LLM gateway setup

Python YouTube downloader with async downloads and real-time queue management