Speed in LLM Evaluation
Measuring and optimizing latency, throughput, and time-to-result
What is Speed?
Speed in LLM evaluation encompasses two interconnected dimensions that drive production deployment decisions:
- Model Inference Speed: How fast a model generates tokens. Measured in latency (milliseconds to first token, total request time), throughput (tokens generated per second), and percentile response times (P50, P95, P99). Determines user experience and real-time capability.
- Evaluation Pipeline Speed: How quickly you can run complete benchmarks and get results. Evaluation that takes 48 hours becomes a bottleneck for iteration. Fast evaluation enables rapid experimentation and model improvement cycles.
Both dimensions matter. A model that generates tokens in 100ms but takes 2 weeks to evaluate is inefficient for development. Conversely, a model that evaluates instantly but produces responses in 5 seconds creates poor user experience. Speed evaluation requires measuring both.
Why Speed Matters
- User Experience: Response latency budgets exist in production. Chatbots must respond in under 2 seconds for acceptable UX. Code completion in IDEs must respond in under 500ms. Speed directly determines whether users perceive the system as responsive or sluggish.
- Real-Time Applications: Certain applications require low latency to be viable. Live transcription, real-time translation, code completion, and interactive tutoring systems cannot function with slow models. Speed isn't optional—it's a requirement.
- Evaluation Iteration Velocity: Development organizations measure iteration speed in weeks or days. A full evaluation suite that takes 48 hours becomes a bottleneck. Faster evaluation enables more experiments, faster feedback loops, and quicker model improvement cycles. Speed here is multiplied across your entire research team.
- Cost Implications: Slower models require more compute to serve. A model generating 50 tokens/second needs 2x the infrastructure of a 100 token/second model to serve the same throughput. Speed improvements directly reduce infrastructure costs at scale.
- Competitive Advantage: In deployed systems, speed becomes a feature. Users choose faster tools. First-mover advantage often goes to the system that's both accurate AND fast.
Key Metrics
Speed evaluation requires multiple metrics across the inference and evaluation dimensions:
Time to First Token (TTFT)
Latency from when a request is submitted to when the model returns the first token of the response.
Tokens per Second (TPS)
Generation throughput: how many tokens the model produces per second of wall-clock time.
End-to-End Latency
Total time from request submission to complete response delivery, including all overhead (tokenization, routing, post-processing).
P50/P95/P99 Latency
Percentile-based latency distributions. P50 is median, P95 is 95th percentile, P99 is 99th percentile.
Throughput (Requests per Second)
How many concurrent requests a system can handle per second under realistic load.
Evaluation Pipeline Time
Wall-clock time to run complete benchmark suite: generate predictions, score them, and produce final results.
Speed Benchmarks
Understanding speed requires context. How does a model compare against standard baselines and other models in its class?
Community-powered rankings of model latency in real chat scenarios. Shows wall-clock response time under live serving conditions. Essential real-world reference for deployment planning.
Comprehensive speed benchmarks across major models. Measures inference speed (tokens/second) on standardized hardware. Allows apples-to-apples model comparison.
Measures throughput (requests per second) for different models and batch configurations. Shows scaling behavior and batching impact. Critical for understanding production serving.
Typical patterns: 7B models (30-50 TPS on single GPU), 13B models (15-25 TPS), 70B models (3-7 TPS), 400B+ models (1-3 TPS). Quantization can improve speed 2-4x at accuracy cost.
FP16 (baseline): 1x speed. INT8: 1.5-2x faster. INT4: 2-4x faster. GGUF quantization: 1.5-3x faster on CPU. Speed gains come with accuracy loss—requires evaluation on your benchmarks.
Same model varies dramatically across hardware. A7 GPU, L40S, RTX 4090, M3 Max, TPU—all have different speed profiles. Benchmark on your target hardware.
Optimization Strategies
Speed can be improved across multiple dimensions. Each has tradeoffs:
Model Quantization
- GPTQ (Gradient Quantized): Mixed-bit quantization (INT4-INT8). Can be 4x faster with minimal accuracy loss. Requires calibration on sample data. Best for deployment.
- AWQ (Activation-Aware Quantization): Similar to GPTQ but preserves critical activations. Often better accuracy-speed tradeoff. Newer, less hardware support.
- GGUF (GPT-Generated Unified Format): CPU-optimized quantization. Enables inference on consumer hardware. 1-3x speedup but lower absolute speed than GPU inference.
- Accuracy Tradeoff: INT4 quantization typically costs 1-3% accuracy. Benchmark before deploying. Some models are more sensitive than others.
Speculative Decoding
- Use fast, small model to draft tokens. Large model verifies in batches. Can achieve 1.5-3x speedup without accuracy loss. Requires two models, adds complexity.
- Most effective when small model is very fast and draft length is high (10-20 tokens). Works well for text generation, less for reasoning tasks.
KV-Cache Optimization
- KV-cache stores key-value tensors for all previous tokens. Becomes bottleneck for long sequences. Optimization strategies: KV-cache quantization (int8/int4), selective attention (drop low-importance cache), hierarchical cache.
- Can improve speed 1.5-2x for long context. Critical for real-world applications where context grows (chat history, documents).
Batching Strategies
- Static Batching: Wait for fixed batch size before processing. Simple, predictable, but high latency (waiting for batch to fill).
- Dynamic/Continuous Batching: Process requests as they arrive, batch dynamically. Better latency and throughput trade-off. Standard in production systems.
- Token-Level Batching: Interleave requests at token level. Maximum efficiency but complex implementation. Used in state-of-art inference servers.
Parallel Evaluation
- Run inference on multiple GPUs/instances in parallel. Linear scaling if no I/O bottleneck. Can reduce 48-hour evaluation to 2-4 hours on 12+ GPU cluster.
- Requires proper parallelization: split dataset, avoid duplicates, aggregate results correctly.
Streaming Evaluation Results
- Start reporting partial results as they complete. Enables early stopping, hypothesis testing before full evaluation. Requires streaming infrastructure but saves time for iterative development.
Speed-Accuracy Tradeoff
Speed improvements almost always come with accuracy costs. The key is finding the right balance for your use case:
- Smaller Models: 7B model is 3-5x faster than 70B but typically 10-15% less accurate. Good for cost-sensitive or latency-critical applications.
- Quantization: INT4 quantization is 3-4x faster but costs 1-3% accuracy. Often worth it for deployment. INT8 is safer with minimal accuracy loss.
- Speculative Decoding: Minimal accuracy cost, pure speed gain. But requires two models and is more complex to implement.
- Context Reduction: Using less context is faster but may reduce accuracy. Requires testing on your task.
Pareto Frontier Concept: Not all speed-accuracy combinations are useful. A model that's both slower AND less accurate than another is dominated. Focus on the Pareto frontier—the set of models where you can't improve speed without losing accuracy (or vice versa). Your choice depends on your specific SLOs and requirements.
Related Resources
Return to the main LLM Evaluation Framework
Core Benchmarks SectionComplete list of evaluation benchmarks with speed references
Reference Tools & LabsPractical tools for measuring and optimizing inference speed
Tools Cost PillarSpeed and cost are deeply linked. Explore the tradeoff.
Related Accuracy PillarUnderstanding accuracy-speed tradeoffs in practice
Related Other PillarsExplore other evaluation dimensions: Accuracy, Cost, Safety, and more
Explore