Production Evaluation
Continuous evaluation, monitoring, governance, and EvalOps at scale
From Development to Production
Pre-deployment evaluations provide confidence that a model meets requirements before launch. However, production behavior diverges from lab testing due to real-world complexity, shifting data distributions, and unexpected edge cases. Why evaluation doesn't stop at deployment:
- Distribution Shift: Real-world data differs from curated test sets. Seasonal patterns, domain shifts, and data quality variations emerge in production that weren't visible during development.
- Time-Varying Performance: Model behavior drifts over time as user queries evolve, knowledge becomes outdated, and feedback loops introduce new patterns. Performance that meets requirements today may degrade tomorrow.
- Latency & Scale Effects: Lab evaluations often run on small batches with artificial conditions. Production reveals bottlenecks, timeout issues, and performance degradation under real load that offline testing misses.
- User Satisfaction vs. Metrics: Automated metrics are proxies for user satisfaction. Production feedback (ratings, complaints, support tickets) reveals gaps between metric scores and actual user experience.
- Edge Cases & Adversarial Inputs: Real users find corner cases and adversarial inputs that curated test sets never include. Continuous evaluation catches these failure modes early.
Continuous Evaluation
Production monitoring requires three complementary strategies: drift detection to identify distribution shifts, regression monitoring to catch quality degradation, and scheduled evaluation runs to continuously validate model performance:
Drift Detection
Detect when production data distribution diverges from training data. Use statistical tests (Kolmogorov-Smirnov, Jensen-Shannon divergence, or Population Stability Index) to identify feature drift, label shift, and concept drift. Set thresholds that trigger investigation and re-evaluation.
Regression Monitoring
Track quality metrics continuously. Define baseline performance and alert when metrics drop below thresholds. Implement confidence intervals around metrics to account for sample size variation. Monitor subgroup performance separately to catch disparate impact.
Scheduled Evaluation Runs
Run comprehensive evaluations on a regular schedule (weekly, bi-weekly, monthly). Sample production data to evaluate on realistic inputs. Compare results against baseline to quantify performance changes. Use evaluation results to trigger model retraining, hyperparameter tuning, or rollback decisions.
EvalOps Architecture
EvalOps is the operational infrastructure for running evaluations at scale. A production-grade evaluation system consists of integrated components handling data collection, evaluation execution, result storage, and decision support:
Data Pipeline
Samples production requests and ground truth labels. Manages data quality checks, deduplication, and privacy-preserving data collection. Routes clean, labeled data to the evaluation runner.
Eval Runner
Orchestrates evaluation job execution. Manages parallelization across distributed workers. Handles retries, timeouts, and resource allocation. Produces standardized evaluation result formats.
Metrics Store
Centralized database for evaluation results. Stores raw scores, aggregated metrics, and historical trends. Enables time-series analysis and anomaly detection. Supports efficient querying by metric, time range, and model version.
Alerting
Detects quality regressions and anomalies. Compares current metrics against baselines and thresholds. Triggers notifications to engineering teams. Provides context (severity, affected subgroups, recommended actions).
Dashboard
Real-time visualization of evaluation metrics and trends. Allows stakeholders to explore performance by metric, model version, time period, and subgroup. Supports drill-down into individual evaluation results.
Version Control
Tracks evaluation methodology, test data versions, and metric definitions. Enables reproducibility of results. Documents changes to evaluation criteria over time. Supports rollback to previous configurations.
Model Selection
Production model selection requires balancing multiple dimensions: accuracy, latency, cost, and fairness. Rather than optimizing a single metric, use Pareto-optimal analysis to identify models that maximize value:
- Pareto Optimality: A model is Pareto-optimal if no other model dominates it across all important dimensions. Build the frontier of tradeoff models and let stakeholders choose based on their priorities.
- Cost-Performance Tradeoffs: Plot accuracy vs. inference cost to identify efficient models. A smaller model with 91% accuracy at $0.001/request may be preferable to a larger model with 94% accuracy at $0.005/request.
- Fairness Constraints: Define minimum fairness thresholds (e.g., performance gap < 5% across demographic groups). Filter to models meeting constraints before evaluating accuracy-cost tradeoffs.
- Production Constraints: Account for infrastructure limitations (latency budgets, GPU availability, inference SLA). Models meeting all constraints form the feasible set; optimize within that set.
- Staged Rollout: Deploy to a small percentage of traffic first. Monitor real-world performance before full rollout. Use A/B testing to compare models directly on production data.
Reporting & Governance
Evaluation results drive critical business decisions and must be documented, communicated, and governed properly. Different stakeholders need different information from evaluation reports:
Stakeholder Reports
Technical teams need deep methodological details (metric formulas, test case distribution, confidence intervals). Product managers need trend analysis and business impact (accuracy improvement -> fewer support tickets). Executives need concise summaries and decision recommendations. Compliance teams need audit trails and governance documentation.
Compliance & Audit Trails
Maintain comprehensive evaluation logs documenting who evaluated what model with which test data at what time. Track methodology changes (metric definitions, threshold adjustments). Document approval decisions for production deployments. Support regulatory audits with reproducible evaluation evidence.
Version Control for Evaluations
Treat evaluation methodology as code. Version control test datasets, metric definitions, and evaluation scripts. Document changes to evaluation criteria over time. Enable reproduction of historical evaluation results for audits and postmortems.
Practical Tips
- Start Simple, Iterate: Don't build a perfect EvalOps system immediately. Start with basic continuous evaluation (sampled data + weekly re-evaluation), then add drift detection, alerting, and dashboards incrementally.
- Sample Strategically: Sample production data proportionally by subgroup to ensure drift detection covers all populations. Use stratified sampling for balanced representation of rare classes or edge cases.
- Set Actionable Thresholds: Define alert thresholds that trigger concrete actions. "Accuracy drops 2%" should trigger investigation. "Accuracy drops 0.1%" creates alert fatigue. Calibrate thresholds to business impact.
- Monitor Latency and Cost: Evaluation isn't just accuracy. Track inference latency and cost drift over time. A model with constant accuracy but 50% higher latency is a regression for production systems.
- Automate Labeling Where Possible: Manual labeling is expensive. Use weak labels (click-through rates, user ratings), LLM-generated labels, or active learning to scale. Combine weak labels with periodic human review.
- Plan for Distribution Shifts: Expect data to shift. Design evaluation systems to detect and quantify drift. Build retraining pipelines to handle routine concept drift. Document fallback procedures when drift exceeds acceptable limits.
- Version Control Everything: Track model versions, test data versions, and evaluation methodology versions. Reproduce historical results for postmortems. Audit who approved each production deployment and why.
Related Resources
Return to the main LLM Evaluation Framework
Core Tooling & InfrastructurePlatforms and tools for implementing production evaluation systems
Tools Labs & ExperimentsInteractive tools and notebooks for production evaluation workflows
Hands-On Other PillarsExplore related evaluation dimensions: Accuracy, Efficiency, Robustness, Fairness
Explore Framework FundamentalsCore concepts and principles underlying production evaluation
Foundation