How AI gateways affect latency during real-world outages

When a leading cloud provider’s API stumbles, your AI application doesn’t care about the model—it cares about getting a response. A recent experiment at Nexus Labs exposed the gap between lab benchmarks and real-world failure recovery. For 30 days, the team instrumented three AI gateways under identical traffic, measuring how each handled provider outages and latency spikes. The results challenge assumptions about routing overhead and highlight why failover performance deserves more attention than model selection.

Why outage recovery matters more than model choice

Nexus Labs processes 2.4 million LLM requests daily, with half routed to OpenAI and the rest distributed across Anthropic, AWS Bedrock, and Google Vertex. On April 23, OpenAI experienced a four-hour incident that blocked traffic for 38 minutes before Nexus’s homegrown retry logic rerouted requests. That downtime exposed a critical flaw: most gateway benchmarks focus on cold-start throughput, not recovery speed under real failures.

The team replaced the retry layer with three open-source gateways—Bifrost, LiteLLM, and Portkey—each configured to handle the same fallback chain: OpenAI as primary, Anthropic as secondary, and AWS Bedrock as tertiary. Cache was disabled, rate limits mirrored production, and all tests ran on identical hardware (c6i.4xlarge instances behind a network load balancer).

Benchmarking failover: latency, memory, and recovery time

After 720 hours of mirrored traffic, the gateways showed stark differences in performance. Bifrost, written in Go, achieved the lowest overhead and fastest failover, while LiteLLM’s Python-based implementation lagged behind in both metrics. Portkey’s self-hosted version performed better than LiteLLM but still required 340ms to recover from a downed provider. Memory usage also varied significantly, with LiteLLM consuming over twice the RAM of Bifrost at 1,000 requests per second.

| Gateway | p50 Overhead | p99 Overhead | Failover Time (ms) | Memory at 1k RPS | |---------------|--------------|--------------|--------------------|-------------------| | Bifrost | 3ms | 11ms | 180 | 412 MB | | LiteLLM | 8ms | 41ms | 620 | 890 MB | | Portkey | 6ms | 29ms | 340 | 650 MB |

Bifrost’s edge came from its synchronous fallback evaluation, which avoids re-queuing requests and reduces latency on retries. LiteLLM’s strength, however, lay in its extensibility—custom cost-tracking callbacks proved invaluable for financial reporting. Portkey’s managed offering closed the gap in features, but its self-hosted version lacked some parity with the cloud version.

Practical use cases beyond routing

Beyond failover, the team leveraged Bifrost for three key workflows:

Automatic provider switching: When OpenAI returned a 429 error, requests seamlessly redirected to Anthropic with an equivalent model, without changes to the agent code. The gateway handled retries, model mapping, and response formatting internally.

Semantic caching: For an evaluation suite that repeats 18,000 prompts nightly, Bifrost’s cache achieved a 73% hit rate, saving roughly 13,000 API calls per night. The feature proved especially useful for regression testing against new model versions.

Observability integration: Native Prometheus metrics required just five minutes to integrate with an existing stack. While the default dashboards needed tweaking, the raw metrics enabled granular performance tracking.

Portkey and LiteLLM offered similar capabilities, but with different trade-offs. LiteLLM’s plugin ecosystem supported advanced cost tracking, while Portkey’s managed control plane appealed to teams unwilling to self-host.

Key considerations before adopting a gateway

Bifrost’s younger codebase supports 23 providers, but niche LLMs may require custom configuration. The plugin interface is straightforward for developers, yet still demands engineering time. LiteLLM, by contrast, boasts broader community support and a mature callback system—ideal for teams already invested in its ecosystem. Portkey suits organizations prioritizing a managed solution over operational overhead.

The team chose not to use features like MCP gateway, governance tools, or SSO, opting instead for external auth solutions. For teams with similar needs, the decision to avoid built-in governance may simplify adoption but limit scalability for larger deployments.

Final takeaway: test before you commit

These metrics reflect Nexus Labs’ specific workload. Traffic patterns, request sizes, and provider dependencies will influence results. Before migrating, run your own benchmarks under realistic failure scenarios. The model is only as reliable as the infrastructure routing it—and in production, that infrastructure must handle outages gracefully.

AI summary

Bifrost, LiteLLM ve Portkey’in 30 günlük üretim verileriyle karşılaştırmalı analizi. Hangi AI geçidi en hızlı devreye alma süresi sunuyor? Performans ve gecikme verileriyle detaylı inceleme.

How AI gateways affect latency during real-world outages

Why outage recovery matters more than model choice

Benchmarking failover: latency, memory, and recovery time

Practical use cases beyond routing

Key considerations before adopting a gateway

Final takeaway: test before you commit

Comments

Euro Toolhub emerges as German-first directory for European software choices

Build tech talks with YAML and Prezi-style zooms

Understanding Gson’s Silent Bugs That Break Your Kotlin App