Speed in LLM Evaluation

Measuring and optimizing latency, throughput, and time-to-result

What is Speed?

Speed in LLM evaluation encompasses two interconnected dimensions that drive production deployment decisions:

Both dimensions matter. A model that generates tokens in 100ms but takes 2 weeks to evaluate is inefficient for development. Conversely, a model that evaluates instantly but produces responses in 5 seconds creates poor user experience. Speed evaluation requires measuring both.

Why Speed Matters

Key Metrics

Speed evaluation requires multiple metrics across the inference and evaluation dimensions:

Time to First Token (TTFT)

Latency from when a request is submitted to when the model returns the first token of the response.

Example
250ms on consumer hardware, 50ms on optimized inference servers
Perceived responsiveness. Users feel 500ms+ TTFT as slow. Critical for interactive applications (chat, code completion).
TTFT alone doesn't reflect full latency. A model with 50ms TTFT but 10 seconds end-to-end latency is still slow.

Tokens per Second (TPS)

Generation throughput: how many tokens the model produces per second of wall-clock time.

Example
100 TPS on A100 GPU, 10 TPS on CPU, 5 TPS on edge devices
Understanding generation speed and compute requirements. Used to estimate request-to-response time for different response lengths.
TPS varies with batch size, sequence length, and hardware. Needs to be measured under realistic conditions.

End-to-End Latency

Total time from request submission to complete response delivery, including all overhead (tokenization, routing, post-processing).

Example
2000ms for a 500-token response on inference API
Real-world performance metric. What users and applications actually measure. Most important for production systems.
Varies significantly with request complexity, response length, and system load. Measure under production-like conditions.

P50/P95/P99 Latency

Percentile-based latency distributions. P50 is median, P95 is 95th percentile, P99 is 99th percentile.

Example
P50: 1500ms, P95: 3000ms, P99: 5500ms
Understanding tail latency and SLO requirements. P99 is often the constraint for production systems ("99% of requests must complete in 3 seconds").
Easy to miss if only reporting mean latency. A system with mean 1s but P99 10s violates most SLOs despite good average performance.

Throughput (Requests per Second)

How many concurrent requests a system can handle per second under realistic load.

Example
100 RPS on batched inference, 10 RPS on single-request setup
System capacity planning. Determines infrastructure cost and user concurrency support.
Depends heavily on batching strategy. Unbatched serving often shows 1-10 RPS; batched can reach 100s. Not comparable across different serving strategies.

Evaluation Pipeline Time

Wall-clock time to run complete benchmark suite: generate predictions, score them, and produce final results.

Example
12 hours for full MMLU-Pro on single GPU, 2 hours on 6-GPU cluster
Development iteration speed. Direct enabler of research velocity. Every hour saved multiplies across team experiments.
Parallelization can hide true per-model latency. Report both single-instance and parallelized times.

Speed Benchmarks

Understanding speed requires context. How does a model compare against standard baselines and other models in its class?

Chatbot Arena Latency Rankings

Community-powered rankings of model latency in real chat scenarios. Shows wall-clock response time under live serving conditions. Essential real-world reference for deployment planning.

Artificial Analysis Speed Index

Comprehensive speed benchmarks across major models. Measures inference speed (tokens/second) on standardized hardware. Allows apples-to-apples model comparison.

LMSYS Throughput Comparisons

Measures throughput (requests per second) for different models and batch configurations. Shows scaling behavior and batching impact. Critical for understanding production serving.

Model Size vs. Speed Tradeoff

Typical patterns: 7B models (30-50 TPS on single GPU), 13B models (15-25 TPS), 70B models (3-7 TPS), 400B+ models (1-3 TPS). Quantization can improve speed 2-4x at accuracy cost.

Quantization Impact

FP16 (baseline): 1x speed. INT8: 1.5-2x faster. INT4: 2-4x faster. GGUF quantization: 1.5-3x faster on CPU. Speed gains come with accuracy loss—requires evaluation on your benchmarks.

Hardware-Specific Benchmarks

Same model varies dramatically across hardware. A7 GPU, L40S, RTX 4090, M3 Max, TPU—all have different speed profiles. Benchmark on your target hardware.

Optimization Strategies

Speed can be improved across multiple dimensions. Each has tradeoffs:

Model Quantization

Speculative Decoding

KV-Cache Optimization

Batching Strategies

Parallel Evaluation

Streaming Evaluation Results

Speed-Accuracy Tradeoff

Speed improvements almost always come with accuracy costs. The key is finding the right balance for your use case:

Pareto Frontier Concept: Not all speed-accuracy combinations are useful. A model that's both slower AND less accurate than another is dominated. Focus on the Pareto frontier—the set of models where you can't improve speed without losing accuracy (or vice versa). Your choice depends on your specific SLOs and requirements.

Related Resources