Foundations of LLM Evaluation

Core evaluation theory, metrics design, statistical rigor, and first principles

Why Evaluation Matters

Evaluation is not optional—it is the foundational practice that separates rigorous AI deployment from wishful thinking. Without systematic evaluation, organizations face compounded risks across business, technical, and safety dimensions.

Business Impact

Model selection decisions carry significant economic implications. A difference of 16% in model capability can reduce operational costs per resolved task by 40% when accounting for escalations and failures. Beyond cost, inadequate evaluation creates substantial risk: a single production failure due to insufficient evaluation testing can cost upwards of $500K in direct costs plus millions in reputational damage and remediation.

Technical Correctness

Benchmark performance and real-world performance diverge significantly. Frontier models score 85-95% on standard benchmarks, making differentiation difficult. Distribution mismatch, benchmark saturation, and the contamination problem (where benchmarks appear in training data) all mean that evaluation must extend beyond published benchmarks to task-specific, real-world data to inform deployment decisions.

Safety and Compliance

Without explicit safety evaluation, harmful capabilities can remain undetected in production systems. The EU AI Act, US Executive Order on AI, and sector-specific regulations now mandate rigorous evaluation documentation and performance guarantees. Evaluation is essential for demonstrating reasonable care, due diligence, and compliance with emerging regulatory frameworks.

Taxonomy of Evaluations

Evaluation varies along multiple critical dimensions. Understanding these dimensions is essential for designing evaluations that match your specific needs and inform the right decisions.

Intrinsic vs. Extrinsic

Intrinsic evaluation measures model capabilities in isolation using benchmarks and standardized tests. It's fast, reproducible, and excellent for early-stage capability assessment and model screening. Extrinsic evaluation measures performance in real-world applications, capturing task-specific success and business outcomes. It's slower and more expensive but provides the strongest signal for deployment decisions. Best practice: combine both—use intrinsic for rapid iteration, extrinsic for final validation.

Automated vs. Human Evaluation

Automated metrics like BLEU, ROUGE, and BERTScore are fast and scalable but often misaligned with human judgment. Human evaluation has the best correlation with user experience but is expensive and slow. Hybrid approaches combine both: use automated metrics for screening, then human annotators for high-uncertainty or high-stakes cases.

Static vs. Dynamic vs. Live Benchmarks

Static benchmarks (MMLU, HumanEval) provide reproducible, comparable results but are vulnerable to contamination and saturation. Dynamic benchmarks generate fresh evaluation tasks, reducing contamination risk. Live benchmarks (Chatbot Arena) evaluate on real user queries, providing the strongest signal but with annotation quality variation.

Pre-Deployment vs. Continuous

Pre-deployment evaluation is a comprehensive gate for production—it should cover capability, safety, alignment, and task-specific dimensions. Continuous evaluation monitors performance in production, detecting degradation and data drift. Both are essential; pre-deployment evaluation without continuous monitoring leaves organizations blind to post-launch failures.

Core Metrics

Different metrics capture different dimensions of model quality. None is perfect alone—combine them for comprehensive understanding:

Exact Match (EM)

Character-level match between model output and reference answer.

When to Use
Factoid QA, standardized test questions, short unambiguous answers
Formula
EM = (# exact matches) / (# total samples)
Pitfalls
Too strict for open-ended answers; misses semantically correct paraphrases

F1 Score

Token-level overlap balancing precision and recall of matched tokens.

When to Use
Extractive QA, span-based answers, evaluating partial correctness
Formula
F1 = 2 × (Precision × Recall) / (Precision + Recall)
Pitfalls
Doesn't capture word order or semantic meaning; treats all tokens equally

BLEU Score

N-gram overlap between generated and reference text, widely used for translation.

When to Use
Machine translation, summarization, text generation tasks
Formula
BLEU = BP × exp(Σ log(pₙ)) where pₙ = matched n-grams / total n-grams
Pitfalls
Poor human correlation; punishes valid paraphrases; tokenization-sensitive

ROUGE Score

Recall-focused n-gram overlap, standard for summarization evaluation.

When to Use
Abstractive summarization, document generation evaluation
Formula
ROUGE-N = Σ(matched N-grams) / Σ(total reference N-grams)
Pitfalls
Length-biased; doesn't penalize factual errors if n-grams overlap

BERTScore

Semantic similarity via contextual embeddings, capturing meaning beyond n-grams.

When to Use
Open-ended generation, paraphrase detection, semantic evaluation
Formula
BERTScore = avg(cosine_similarity(embedding_pred, embedding_ref))
Pitfalls
Doesn't guarantee factual accuracy; sensitive to embedding model quality

Perplexity

How surprised the model is by the true sequence; standard for language model evaluation.

When to Use
Language model pre-training, fine-tuning, next-token prediction quality
Formula
PPL = exp(-1/N Σ log p(wᵢ)) where p(wᵢ) is token probability
Pitfalls
Logarithmic scale masks probability differences; tokenization-dependent

Statistical Rigor

A metric score alone is meaningless without uncertainty quantification and statistical justification. "Model A achieves 87.3% accuracy" hides critical information about variance, significance, and reliability.

Confidence Intervals

Always report confidence intervals, not just point estimates. The bootstrap method is the workhorse: resample your evaluation set with replacement 1,000-10,000 times, compute the metric each time, and use the 2.5% and 97.5% percentiles as your 95% confidence interval. This works for any metric and requires no distributional assumptions.

Effect Sizes

When comparing two models, report effect sizes (Cohen's d or Cliff's delta) alongside p-values. An effect size quantifies the magnitude of difference: small effects require many samples to detect, large effects are detectable with few samples. A statistically significant difference is not necessarily practically meaningful.

Paired vs. Unpaired Tests

If both models are evaluated on the same samples, use paired tests (paired t-test, Wilcoxon signed-rank). Paired tests eliminate sample-to-sample variance and have more statistical power. If different samples are used, use unpaired tests. This choice fundamentally affects statistical conclusions.

Multiple Comparisons Correction

Comparing N models means N(N-1)/2 pairwise tests. Each test has a 5% false positive rate; comparing 10 models gives ~25% chance of at least one false positive by chance. Correct using Bonferroni (conservative, divides α by M), Holm (less conservative), or Benjamini-Hochberg FDR (for many tests).

Prompt Sensitivity

Evaluate models across multiple prompt variations. Report mean performance and standard deviation across prompts. A model with 87.3% on one prompt might be 83% or 91% on another. Stable performance across prompts is stronger evidence of genuine capability.

Designing Your Evaluation Strategy

No single evaluation approach is best. Your strategy depends on your specific question, constraints, and use case. Consider these dimensions:

Capability Assessment

Use intrinsic benchmarks for initial capability screening (fast, standardized), then extrinsic task-specific evaluation on your actual use case data (slower, more informative). Measure accuracy, F1, semantic similarity, and task-specific metrics. Always disaggregate results by subgroup (task type, difficulty level, domain) to identify weaknesses.

Safety Evaluation

Pre-deployment: conduct red teaming and adversarial testing. Test hallucination (using tools like SelfCheckGPT), jailbreak attempts, bias and toxicity, and alignment with intended constraints. Continuous: maintain automated safety monitoring (toxicity classifiers, refusal consistency checks, factuality monitoring). Both are mandatory for production systems.

Statistical Design

Before evaluating, use power analysis to determine required sample size. Set effect size of interest and desired power (80-90% is standard). For typical evaluations: to detect a medium effect (d=0.5) with 80% power requires ~64 samples per group. Report confidence intervals, effect sizes, and multiple comparison corrections for all results.

Practical Tips

  • Use Multiple Metrics: No single metric captures quality fully. For generation tasks, combine n-gram metrics (BLEU/ROUGE) with semantic metrics (BERTScore). For classification, use accuracy, precision, recall, and F1.
  • Always Include Human Evaluation: Automated metrics are proxies, not ground truth. Human evaluate 5-10% of data in a stratified sample to calibrate automated scores and catch metric blind spots. Report inter-annotator agreement (Cohen's kappa or Krippendorff's alpha).
  • Test Adversarial Inputs: Evaluate on edge cases, out-of-distribution examples, and adversarial inputs. A model with 95% on clean data might collapse to 60% on adversarial examples.
  • Monitor in Production: Set up continuous evaluation pipelines. Accuracy degrades over time as user queries shift and knowledge becomes outdated. Monitor weekly or monthly using lightweight automated metrics.
  • Set Minimum Thresholds: Define what "good enough" looks like before deployment. For healthcare: 98%+. For customer support: 90%+. For creative writing: 70%+. Make requirements explicit.
  • Disaggregate Results: Report accuracy not overall, but by task type, difficulty level, topic, or demographic group. A model might be 92% accurate overall but only 70% on non-English inputs.
  • Evaluate Early and Often: Run evaluation during development to identify high-error patterns. Use results to prioritize improvements or data collection. Avoid surprises at deployment time.
  • Report Methodology in Detail: Document prompt templates, metric computation, answer normalization, sample sizes, confidence intervals, and assumptions. Reproducibility depends on methodological clarity.

Related Resources