Production Evaluation

Continuous evaluation, monitoring, governance, and EvalOps at scale

From Development to Production

Pre-deployment evaluations provide confidence that a model meets requirements before launch. However, production behavior diverges from lab testing due to real-world complexity, shifting data distributions, and unexpected edge cases. Why evaluation doesn't stop at deployment:

Distribution Shift: Real-world data differs from curated test sets. Seasonal patterns, domain shifts, and data quality variations emerge in production that weren't visible during development.
Time-Varying Performance: Model behavior drifts over time as user queries evolve, knowledge becomes outdated, and feedback loops introduce new patterns. Performance that meets requirements today may degrade tomorrow.
Latency & Scale Effects: Lab evaluations often run on small batches with artificial conditions. Production reveals bottlenecks, timeout issues, and performance degradation under real load that offline testing misses.
User Satisfaction vs. Metrics: Automated metrics are proxies for user satisfaction. Production feedback (ratings, complaints, support tickets) reveals gaps between metric scores and actual user experience.
Edge Cases & Adversarial Inputs: Real users find corner cases and adversarial inputs that curated test sets never include. Continuous evaluation catches these failure modes early.

Continuous Evaluation

Production monitoring requires three complementary strategies: drift detection to identify distribution shifts, regression monitoring to catch quality degradation, and scheduled evaluation runs to continuously validate model performance:

Drift Detection

Detect when production data distribution diverges from training data. Use statistical tests (Kolmogorov-Smirnov, Jensen-Shannon divergence, or Population Stability Index) to identify feature drift, label shift, and concept drift. Set thresholds that trigger investigation and re-evaluation.

Regression Monitoring

Track quality metrics continuously. Define baseline performance and alert when metrics drop below thresholds. Implement confidence intervals around metrics to account for sample size variation. Monitor subgroup performance separately to catch disparate impact.

Scheduled Evaluation Runs

Run comprehensive evaluations on a regular schedule (weekly, bi-weekly, monthly). Sample production data to evaluate on realistic inputs. Compare results against baseline to quantify performance changes. Use evaluation results to trigger model retraining, hyperparameter tuning, or rollback decisions.

EvalOps Architecture

EvalOps is the operational infrastructure for running evaluations at scale. A production-grade evaluation system consists of integrated components handling data collection, evaluation execution, result storage, and decision support:

Data Pipeline

Samples production requests and ground truth labels. Manages data quality checks, deduplication, and privacy-preserving data collection. Routes clean, labeled data to the evaluation runner.

Key Functions

Sampling, labeling, quality checks, privacy masking

Eval Runner

Orchestrates evaluation job execution. Manages parallelization across distributed workers. Handles retries, timeouts, and resource allocation. Produces standardized evaluation result formats.

Key Functions

Job scheduling, orchestration, worker management, result standardization

Metrics Store

Centralized database for evaluation results. Stores raw scores, aggregated metrics, and historical trends. Enables time-series analysis and anomaly detection. Supports efficient querying by metric, time range, and model version.

Key Functions

Result storage, versioning, time-series queries, trend analysis

Alerting

Detects quality regressions and anomalies. Compares current metrics against baselines and thresholds. Triggers notifications to engineering teams. Provides context (severity, affected subgroups, recommended actions).

Key Functions

Anomaly detection, threshold monitoring, notifications, context provision

Dashboard

Real-time visualization of evaluation metrics and trends. Allows stakeholders to explore performance by metric, model version, time period, and subgroup. Supports drill-down into individual evaluation results.

Key Functions

Visualization, trend analysis, drill-down, stakeholder reporting

Version Control

Tracks evaluation methodology, test data versions, and metric definitions. Enables reproducibility of results. Documents changes to evaluation criteria over time. Supports rollback to previous configurations.

Key Functions

Methodology versioning, data versioning, reproducibility, audit trails

Model Selection

Production model selection requires balancing multiple dimensions: accuracy, latency, cost, and fairness. Rather than optimizing a single metric, use Pareto-optimal analysis to identify models that maximize value:

Pareto Optimality: A model is Pareto-optimal if no other model dominates it across all important dimensions. Build the frontier of tradeoff models and let stakeholders choose based on their priorities.
Cost-Performance Tradeoffs: Plot accuracy vs. inference cost to identify efficient models. A smaller model with 91% accuracy at $0.001/request may be preferable to a larger model with 94% accuracy at $0.005/request.
Fairness Constraints: Define minimum fairness thresholds (e.g., performance gap < 5% across demographic groups). Filter to models meeting constraints before evaluating accuracy-cost tradeoffs.
Production Constraints: Account for infrastructure limitations (latency budgets, GPU availability, inference SLA). Models meeting all constraints form the feasible set; optimize within that set.
Staged Rollout: Deploy to a small percentage of traffic first. Monitor real-world performance before full rollout. Use A/B testing to compare models directly on production data.

Reporting & Governance

Evaluation results drive critical business decisions and must be documented, communicated, and governed properly. Different stakeholders need different information from evaluation reports:

Stakeholder Reports

Technical teams need deep methodological details (metric formulas, test case distribution, confidence intervals). Product managers need trend analysis and business impact (accuracy improvement -> fewer support tickets). Executives need concise summaries and decision recommendations. Compliance teams need audit trails and governance documentation.

Compliance & Audit Trails

Maintain comprehensive evaluation logs documenting who evaluated what model with which test data at what time. Track methodology changes (metric definitions, threshold adjustments). Document approval decisions for production deployments. Support regulatory audits with reproducible evaluation evidence.

Version Control for Evaluations

Treat evaluation methodology as code. Version control test datasets, metric definitions, and evaluation scripts. Document changes to evaluation criteria over time. Enable reproduction of historical evaluation results for audits and postmortems.

Practical Tips

Start Simple, Iterate: Don't build a perfect EvalOps system immediately. Start with basic continuous evaluation (sampled data + weekly re-evaluation), then add drift detection, alerting, and dashboards incrementally.
Sample Strategically: Sample production data proportionally by subgroup to ensure drift detection covers all populations. Use stratified sampling for balanced representation of rare classes or edge cases.
Set Actionable Thresholds: Define alert thresholds that trigger concrete actions. "Accuracy drops 2%" should trigger investigation. "Accuracy drops 0.1%" creates alert fatigue. Calibrate thresholds to business impact.
Monitor Latency and Cost: Evaluation isn't just accuracy. Track inference latency and cost drift over time. A model with constant accuracy but 50% higher latency is a regression for production systems.
Automate Labeling Where Possible: Manual labeling is expensive. Use weak labels (click-through rates, user ratings), LLM-generated labels, or active learning to scale. Combine weak labels with periodic human review.
Plan for Distribution Shifts: Expect data to shift. Design evaluation systems to detect and quantify drift. Build retraining pipelines to handle routine concept drift. Document fallback procedures when drift exceeds acceptable limits.
Version Control Everything: Track model versions, test data versions, and evaluation methodology versions. Reproduce historical results for postmortems. Audit who approved each production deployment and why.

Related Resources

← Framework Overview

Return to the main LLM Evaluation Framework

Core Tooling & Infrastructure

Platforms and tools for implementing production evaluation systems

Tools Labs & Experiments

Interactive tools and notebooks for production evaluation workflows

Hands-On Other Pillars

Explore related evaluation dimensions: Accuracy, Efficiency, Robustness, Fairness

Explore Framework Fundamentals

Core concepts and principles underlying production evaluation

Foundation