iToverDose/Software· 28 MAY 2026 · 00:02

Why Keeping Events Close to Aggregates Cuts System Latency by 75%

When a CQRS setup split events and aggregates across Kafka and PostgreSQL, latency and scaling issues forced costly fixes. Merging them into one database cut write time from 48 ms to 12 ms and read time from 900 ms to 45 ms. The approach saved $5k monthly while eliminating lag spikes.

DEV Community3 min read0 Comments

A microservices architecture built on CQRS promised eventual consistency and zero data loss. Instead, it delivered a 40-millisecond write path, a 200-millisecond read path, and a cascade of scaling challenges. The root cause wasn’t the pattern itself—it was the decision to store events in a separate Kafka cluster while keeping aggregates in PostgreSQL. Every network hop, every consumer group, every offset commit cycle added friction that compounded under load. At 800 requests per second, materialized views lagged 2.3 seconds behind reality. At 250,000 requests per second, the system accumulated 4.2 million unprocessed events, triggering nightly pages and Zookeeper timeouts that forced consumer restarts every 15 minutes.

Why the Outbox and Dual-Database Model Failed Under Load

The team initially relied on the outbox pattern: Debezium captured writes from PostgreSQL and streamed them to Kafka. Downstream services consumed these streams to build materialized views. The promise was clear—eventual consistency with durability—but the reality was far from it. Each attempt to fix the lag introduced new bottlenecks. Upgrading Kafka to version 3.5 with transactional producers reduced unprocessed events by only 12%, leaving 3.7 million in the queue. Scaling consumers to 12 Kubernetes pods across three availability zones pushed CPU steal rates to 45% and raised 99th-percentile read latency to 900 milliseconds. Switching from Debezium to Kafka Connect with JDBC source added schema validation pauses that stretched to 20 seconds, pushing the lag to 5.1 million events.

  • Transactional writes improved consistency but didn’t address network latency.
  • Horizontal scaling increased CPU overhead and memory pressure without reducing lag.
  • Schema evolution efforts exposed a deeper issue: the dual-database design was the bottleneck.

Every fix treated a symptom, not the cause. The architecture had become a Rube Goldberg machine where one optimization broke another, and the total cost of ownership kept rising.

Merging Events and Aggregates: One Database, Zero Latency Spikes

The turning point came when the team abandoned the separation. They moved events from Kafka into the same PostgreSQL cluster that housed their aggregate roots, storing them in jsonb columns with a composite GIN index on {aggregate_id, event_sequence}. They replaced Kafka with PostgreSQL’s logical replication slots, feeding events into a single Go service that emitted a compact binary format over internal gRPC streams for downstream consumers. The result was a single round-trip: client to PostgreSQL to replication slot to gRPC.

On the read side, materialized views refreshed in 50 milliseconds instead of 2.3 seconds because they now read events from the same cluster. The replication slot lag vanished from the dashboard entirely. The tradeoffs were manageable: they sacrificed Kafka’s disk spill-over capacity and had to shard PostgreSQL at 5 terabyte nodes. But the latency improvements were immediate. P99 write latency dropped from 48 milliseconds to 12 milliseconds, and p99 read latency for materialized views fell from 900 milliseconds to 45 milliseconds.

  • Eliminated two managed services (Kafka and Debezium).
  • Reduced monitoring overhead by six Prometheus exporters.
  • Cut monthly cloud spend from $18,000 to $13,000.
  • Reduced fleet size from 24 Kafka brokers to 12 PostgreSQL nodes.

The only new failure mode appeared during primary failover, where logical replication lag briefly spiked. The team mitigated this by designating a hot standby replica with pg_rewind, adding only 20 milliseconds to failover time while keeping lag at zero during switchover.

Lessons from the Fire Drill

The most expensive lesson wasn’t technical—it was organizational. The team spent six weeks and $42,000 chasing optimizations before realizing the boundary between events and aggregates was unsustainable. If they had started with a single database and used logical replication as the event bus from day one, they could have avoided the fire drill entirely. They still keep raw event streams in an object store for replay, but they no longer pay the network and latency tax of a second database.

The takeaway is simple: service boundaries must be data-driven, not cargo-cult architectural choices. Kafka isn’t inherently flawed; the flaw was treating it as a default solution without validating whether the latency and cost overheads were justified. Next time, the team will do the math before the system breaks—not after.

Event sourcing and CQRS can coexist with low latency and high availability. The key is keeping related data as close as the laws of physics allow.

AI summary

CQRS mimarisinde olayları ve verileri ayrı tutmak performans krizine yol açabilir. Bir ekip, olayları PostgreSQL'de saklayarak nasıl %28 maliyet düşürdü ve sistem gecikmelerini sıfırladı.

Comments

00
LEAVE A COMMENT
ID #OMUKIE

0 / 1200 CHARACTERS

Human check

3 + 9 = ?

Will appear after editor review

Moderation · Spam protection active

No approved comments yet. Be first.