Speed in LLM Evaluation

Measuring and optimizing latency, throughput, and time-to-result

What is Speed?

Speed in LLM evaluation encompasses two interconnected dimensions that drive production deployment decisions:

Model Inference Speed: How fast a model generates tokens. Measured in latency (milliseconds to first token, total request time), throughput (tokens generated per second), and percentile response times (P50, P95, P99). Determines user experience and real-time capability.
Evaluation Pipeline Speed: How quickly you can run complete benchmarks and get results. Evaluation that takes 48 hours becomes a bottleneck for iteration. Fast evaluation enables rapid experimentation and model improvement cycles.

Both dimensions matter. A model that generates tokens in 100ms but takes 2 weeks to evaluate is inefficient for development. Conversely, a model that evaluates instantly but produces responses in 5 seconds creates poor user experience. Speed evaluation requires measuring both.

Why Speed Matters

User Experience: Response latency budgets exist in production. Chatbots must respond in under 2 seconds for acceptable UX. Code completion in IDEs must respond in under 500ms. Speed directly determines whether users perceive the system as responsive or sluggish.
Real-Time Applications: Certain applications require low latency to be viable. Live transcription, real-time translation, code completion, and interactive tutoring systems cannot function with slow models. Speed isn't optional—it's a requirement.
Evaluation Iteration Velocity: Development organizations measure iteration speed in weeks or days. A full evaluation suite that takes 48 hours becomes a bottleneck. Faster evaluation enables more experiments, faster feedback loops, and quicker model improvement cycles. Speed here is multiplied across your entire research team.
Cost Implications: Slower models require more compute to serve. A model generating 50 tokens/second needs 2x the infrastructure of a 100 token/second model to serve the same throughput. Speed improvements directly reduce infrastructure costs at scale.
Competitive Advantage: In deployed systems, speed becomes a feature. Users choose faster tools. First-mover advantage often goes to the system that's both accurate AND fast.

Key Metrics

Speed evaluation requires multiple metrics across the inference and evaluation dimensions:

Time to First Token (TTFT)

Latency from when a request is submitted to when the model returns the first token of the response.

Example

250ms on consumer hardware, 50ms on optimized inference servers

Perceived responsiveness. Users feel 500ms+ TTFT as slow. Critical for interactive applications (chat, code completion).

TTFT alone doesn't reflect full latency. A model with 50ms TTFT but 10 seconds end-to-end latency is still slow.

Tokens per Second (TPS)

Generation throughput: how many tokens the model produces per second of wall-clock time.

Example

100 TPS on A100 GPU, 10 TPS on CPU, 5 TPS on edge devices

Understanding generation speed and compute requirements. Used to estimate request-to-response time for different response lengths.

TPS varies with batch size, sequence length, and hardware. Needs to be measured under realistic conditions.

End-to-End Latency

Total time from request submission to complete response delivery, including all overhead (tokenization, routing, post-processing).

Example

2000ms for a 500-token response on inference API

Real-world performance metric. What users and applications actually measure. Most important for production systems.

Varies significantly with request complexity, response length, and system load. Measure under production-like conditions.

P50/P95/P99 Latency

Percentile-based latency distributions. P50 is median, P95 is 95th percentile, P99 is 99th percentile.

Example

P50: 1500ms, P95: 3000ms, P99: 5500ms

Understanding tail latency and SLO requirements. P99 is often the constraint for production systems ("99% of requests must complete in 3 seconds").

Easy to miss if only reporting mean latency. A system with mean 1s but P99 10s violates most SLOs despite good average performance.

Throughput (Requests per Second)

How many concurrent requests a system can handle per second under realistic load.

Example

100 RPS on batched inference, 10 RPS on single-request setup

System capacity planning. Determines infrastructure cost and user concurrency support.

Depends heavily on batching strategy. Unbatched serving often shows 1-10 RPS; batched can reach 100s. Not comparable across different serving strategies.

Evaluation Pipeline Time

Wall-clock time to run complete benchmark suite: generate predictions, score them, and produce final results.

Example

12 hours for full MMLU-Pro on single GPU, 2 hours on 6-GPU cluster

Development iteration speed. Direct enabler of research velocity. Every hour saved multiplies across team experiments.

Parallelization can hide true per-model latency. Report both single-instance and parallelized times.

Speed Benchmarks

Understanding speed requires context. How does a model compare against standard baselines and other models in its class?

Chatbot Arena Latency Rankings

Community-powered rankings of model latency in real chat scenarios. Shows wall-clock response time under live serving conditions. Essential real-world reference for deployment planning.

Artificial Analysis Speed Index

Comprehensive speed benchmarks across major models. Measures inference speed (tokens/second) on standardized hardware. Allows apples-to-apples model comparison.

LMSYS Throughput Comparisons

Measures throughput (requests per second) for different models and batch configurations. Shows scaling behavior and batching impact. Critical for understanding production serving.

Model Size vs. Speed Tradeoff

Typical patterns: 7B models (30-50 TPS on single GPU), 13B models (15-25 TPS), 70B models (3-7 TPS), 400B+ models (1-3 TPS). Quantization can improve speed 2-4x at accuracy cost.

Quantization Impact

FP16 (baseline): 1x speed. INT8: 1.5-2x faster. INT4: 2-4x faster. GGUF quantization: 1.5-3x faster on CPU. Speed gains come with accuracy loss—requires evaluation on your benchmarks.

Hardware-Specific Benchmarks

Same model varies dramatically across hardware. A7 GPU, L40S, RTX 4090, M3 Max, TPU—all have different speed profiles. Benchmark on your target hardware.

Optimization Strategies

Speed can be improved across multiple dimensions. Each has tradeoffs:

Model Quantization

GPTQ (Gradient Quantized): Mixed-bit quantization (INT4-INT8). Can be 4x faster with minimal accuracy loss. Requires calibration on sample data. Best for deployment.
AWQ (Activation-Aware Quantization): Similar to GPTQ but preserves critical activations. Often better accuracy-speed tradeoff. Newer, less hardware support.
GGUF (GPT-Generated Unified Format): CPU-optimized quantization. Enables inference on consumer hardware. 1-3x speedup but lower absolute speed than GPU inference.
Accuracy Tradeoff: INT4 quantization typically costs 1-3% accuracy. Benchmark before deploying. Some models are more sensitive than others.

Speculative Decoding

Use fast, small model to draft tokens. Large model verifies in batches. Can achieve 1.5-3x speedup without accuracy loss. Requires two models, adds complexity.
Most effective when small model is very fast and draft length is high (10-20 tokens). Works well for text generation, less for reasoning tasks.

KV-Cache Optimization

KV-cache stores key-value tensors for all previous tokens. Becomes bottleneck for long sequences. Optimization strategies: KV-cache quantization (int8/int4), selective attention (drop low-importance cache), hierarchical cache.
Can improve speed 1.5-2x for long context. Critical for real-world applications where context grows (chat history, documents).

Batching Strategies

Static Batching: Wait for fixed batch size before processing. Simple, predictable, but high latency (waiting for batch to fill).
Dynamic/Continuous Batching: Process requests as they arrive, batch dynamically. Better latency and throughput trade-off. Standard in production systems.
Token-Level Batching: Interleave requests at token level. Maximum efficiency but complex implementation. Used in state-of-art inference servers.

Parallel Evaluation

Run inference on multiple GPUs/instances in parallel. Linear scaling if no I/O bottleneck. Can reduce 48-hour evaluation to 2-4 hours on 12+ GPU cluster.
Requires proper parallelization: split dataset, avoid duplicates, aggregate results correctly.

Streaming Evaluation Results

Start reporting partial results as they complete. Enables early stopping, hypothesis testing before full evaluation. Requires streaming infrastructure but saves time for iterative development.

Speed-Accuracy Tradeoff

Speed improvements almost always come with accuracy costs. The key is finding the right balance for your use case:

Smaller Models: 7B model is 3-5x faster than 70B but typically 10-15% less accurate. Good for cost-sensitive or latency-critical applications.
Quantization: INT4 quantization is 3-4x faster but costs 1-3% accuracy. Often worth it for deployment. INT8 is safer with minimal accuracy loss.
Speculative Decoding: Minimal accuracy cost, pure speed gain. But requires two models and is more complex to implement.
Context Reduction: Using less context is faster but may reduce accuracy. Requires testing on your task.

Pareto Frontier Concept: Not all speed-accuracy combinations are useful. A model that's both slower AND less accurate than another is dominated. Focus on the Pareto frontier—the set of models where you can't improve speed without losing accuracy (or vice versa). Your choice depends on your specific SLOs and requirements.

Related Resources

← Framework Overview

Return to the main LLM Evaluation Framework

Core Benchmarks Section

Complete list of evaluation benchmarks with speed references

Reference Tools & Labs

Practical tools for measuring and optimizing inference speed

Tools Cost Pillar

Speed and cost are deeply linked. Explore the tradeoff.

Related Accuracy Pillar

Understanding accuracy-speed tradeoffs in practice

Related Other Pillars

Explore other evaluation dimensions: Accuracy, Cost, Safety, and more

Explore