Performance testing for large language models (LLMs) often misses critical issues until users complain. A recent experiment using NVIDIA’s AIPerf tool demonstrates how misleading single-user benchmarks can be—and why realistic concurrency testing is essential for production readiness.
LLM deployments frequently rely on superficial metrics that create false confidence. When one developer ran three targeted tests on a locally hosted model, the results exposed a gaping flaw in standard performance evaluation. The experiment, conducted with the Granite 4 350M model via Ollama, revealed that what appeared efficient under ideal conditions became catastrophically slow under load.
Why single-user benchmarks are dangerously incomplete
Most performance tests for LLM endpoints begin with a single-user scenario, simulating an isolated request to gauge responsiveness. While this baseline can produce impressive numbers, it fails to account for the realities of shared infrastructure. In this case, the developer used NVIDIA AIPerf—an open-source successor to GenAI-Perf—to evaluate the Granite 4 350M model running locally on a development machine.
The initial test configuration included:
- A local endpoint at `
- Streaming chat interactions
- 50 total requests with concurrency set to 1
- Built-in tokenizer integration
The output suggested flawless performance:
Metric avg p50 p99
TTFT (ms) 223.11 217.60 317.61
TTST (ms) 10.94 9.99 18.00
ITL (ms) 10.67 10.51 12.35
Request Latency 1,309.30 1,043.95 3,251.73
Throughput 0.76 req/secWith an average time-to-first-token (TTFT) of 223 milliseconds and stable inter-token latency (ITL), the system appeared production-ready. But this narrative changes dramatically when concurrency increases.
The reality under load: when TTFT explodes
The second test introduced realistic conditions by increasing concurrency to 50 simultaneous users and adding a 10-request warmup phase. Running for 60 seconds, this configuration better reflects actual deployment scenarios where multiple users interact with the system simultaneously.
The results were alarming:
Metric avg p50 p99
TTFT (ms) 41,660.92 50,870.37 64,201.68
TTST (ms) 10.21 10.11 13.10
ITL (ms) 10.38 10.18 13.29
Output Throughput 4.86 tokens/sec/user
Request Throughput 0.88 req/secThe average TTFT skyrocketed to over 41 seconds, a 186-fold increase from the baseline. At the 99th percentile, users waited more than 64 seconds just to receive the first token. While the monitoring dashboard might still display green indicators, real users would experience a blank screen for over a minute before any response appeared.
This discrepancy between isolated and multi-user testing highlights a critical oversight in many LLM deployment strategies. What works for one user can collapse under shared load, especially in environments without dedicated high-performance GPUs.
Goodput: the metric that reveals true user impact
The third test introduced a concept often overlooked in performance evaluation: goodput. By setting a strict service-level objective (SLO) of 500 milliseconds for TTFT and using AIPerf’s --goodput flag, the developer measured only requests that met the performance target.
Metric Value
Request Throughput 0.91 req/sec
Goodput 0.01 req/sec
TTFT avg 37,380.20 ms
TTFT p99 55,777.69 msWhile request throughput remained seemingly acceptable at 0.91 requests per second, goodput revealed the harsh truth: only 1% of requests met the SLO. The system was processing requests, but it wasn’t serving users effectively. This distinction separates systems that function from systems that deliver meaningful user experiences.
The hidden truth: ITL remains stable while TTFT collapses
One of the most surprising findings was the behavior of inter-token latency (ITL) across all three tests:
Run TTFT avg (ms) ITL avg (ms)
Single-user 223.11 10.67
Concurrency 50 41,660.92 10.38
Goodput + SLO 37,380.20 9.71While TTFT degraded catastrophically under load, ITL remained remarkably consistent—hovering around 10 milliseconds in every scenario. This stability indicates that once the model begins processing a request, token generation proceeds efficiently regardless of system load. The bottleneck lies entirely in the prefill phase, where requests queue up waiting for the model to initiate processing.
This insight reframes capacity planning entirely. If ITL were degrading, the solution might involve upgrading hardware or switching to a faster model. But since only TTFT is affected, the issue is fundamentally architectural—rooted in queue management, request routing, or horizontal scaling of the inference server.
Three lessons for LLM performance engineering
The experiment distilled into three actionable takeaways:
- Never trust single-user baselines. They create false confidence by ignoring realistic load patterns.
- Always test with realistic concurrency. Multi-user scenarios expose TTFT degradation that single-user tests miss.
- Measure goodput against SLOs. Tracking raw throughput tells only part of the story; goodput reveals actual user impact.
The developer’s simple workflow—three commands, three minutes—upended conventional wisdom about LLM performance evaluation. What appeared efficient in isolation failed dramatically under load, demonstrating that architectural decisions often matter more than model selection or hardware specifications.
Beyond the experiment: a call for better benchmarking standards
This test underscores a growing need for standardized performance evaluation in LLM deployments. Tools like NVIDIA AIPerf provide the granularity required to separate TTFT from ITL, measure goodput, and set realistic SLOs. Without these capabilities, teams risk deploying systems that look healthy in dashboards but deliver poor user experiences.
As LLM adoption accelerates across industries, the gap between perceived and actual performance will only widen. The solution lies not in more sophisticated models, but in more rigorous, realistic testing methodologies that account for the complexities of multi-user environments.
The next time you evaluate an LLM deployment, ask yourself: what are you missing in your performance tests?
AI summary
Yapay zekâ modellerinin performansını ölçerken yapılan yaygın hatalar ve NVIDIA AIPerf aracıyla nasıl gerçekçi sonuçlara ulaşılabileceği hakkında derinlemesine bir rehber.