How Discord optimized its tech stack for over 500 million users

Discord’s early years followed the classic startup playbook: launch fast, scale later. But as user numbers exploded, so did the technical debt. The company didn’t rewrite its stack out of preference—it did so out of necessity, responding to real-world constraints that threatened performance and reliability.

The foundation: Building for today, not tomorrow

When Discord launched in 2013, the team chose a stack optimized for speed over perfection. For real-time messaging, they turned to Elixir, leveraging the BEAM virtual machine’s ability to handle millions of concurrent connections. Python powered the API layer, while Go managed microservices. MongoDB stored user data, and Electron delivered a cross-platform desktop experience.

This approach allowed Discord to ship quickly and iterate. The trade-off? Future scalability challenges. As user growth accelerated, some components struggled to keep pace, forcing the team to confront hard limits.

MongoDB’s scaling ceiling: When 5 million users weren’t enough

By 2017, Discord had crossed the 5 million user mark. MongoDB, though reliable in the early days, couldn’t handle the load. The team migrated to Cassandra, a distributed database designed for high write throughput. The shift worked—for a while.

But growth is relentless. By 2022, those initial 12 Cassandra nodes had ballooned to 177. Maintenance became a nightmare, operational costs spiraled, and performance lagged. The search for a better solution led Discord to ScyllaDB, a Cassandra-compatible database optimized for low latency and high throughput. The migration wasn’t about chasing trends—it was about solving concrete problems.

Elixir’s data crunch: When milliseconds mattered

Discord’s real-time messaging system relied heavily on Elixir, which excelled at handling concurrent connections. But as the platform grew, certain operations—like sorting large datasets—started taking 170 milliseconds per request. At Discord’s scale, that latency compounded into noticeable delays for users.

The solution? Rewrite the critical path in Rust. Rust’s zero-cost abstractions and memory safety guarantees slashed latency to just 1 millisecond. The performance improvement wasn’t incremental—it was transformative. This wasn’t a case of premature optimization; it was a response to measured, real-world pain points.

Go’s garbage collector: The silent performance killer

Discord’s Read States service tracks every message a user has read, updating in real time whenever someone opens the app or sends a message. Under the hood, this service relied on Go’s garbage collector, which runs every two minutes regardless of load.

With millions of users cached in memory, those garbage collection cycles introduced unpredictable latency spikes. Tuning Go’s garbage collector or upgrading to newer versions didn’t solve the problem. The only viable fix? Rewriting the service in Rust, which offers deterministic memory deallocation. The result? Latency dropped from milliseconds to microseconds—critical when user numbers surged to 100 million monthly during the pandemic.

What stayed the same—and why

Not every component needed a rewrite. Elixir remained the backbone of real-time messaging, thanks to the BEAM VM’s ability to handle millions of concurrent processes with near-instant recovery. Python continued to power APIs, and Electron kept the desktop client running smoothly. React Native eventually unified iOS and Android development, replacing separate codebases.

The lesson? Stability matters. When a system works, don’t fix what isn’t broken. Discord’s stack evolved through necessity, not experimentation.

The real lesson: Build for today, scale for tomorrow

Discord’s story isn’t about which language or database is superior. It’s about recognizing limits—whether in latency, scalability, or operational overhead—and responding with data-driven decisions. Every migration had a clear trigger: a metric that couldn’t be tuned away, a threshold crossed, a bottleneck exposed.

The company didn’t aim for 500 million users on day one. It built for the present while preparing for the future. The key takeaway? Know your why before you change your stack. Optimize when the numbers demand it, not when the hype arrives.

The next time someone asks which tech stack to choose, the answer might be simpler than you think: Start with what works. Then, be ready to evolve.

AI summary

Discord’un altyapısını neden sürekli yeniden yazdığına dair performans darboğazları, veritabanı değişiklikleri ve programlama dillerinin değişimiyle ilgili detaylar.

How Discord optimized its tech stack for over 500 million users

The foundation: Building for today, not tomorrow

MongoDB’s scaling ceiling: When 5 million users weren’t enough

Elixir’s data crunch: When milliseconds mattered

Go’s garbage collector: The silent performance killer

What stayed the same—and why

The real lesson: Build for today, scale for tomorrow

Comments

Why your messy codebase makes AI tools stumble

How to Eliminate Static AWS Keys for Safer Cloud Deployments

Why 'Free' Local AI Executors Can Cost More Than Cloud Models