Hands-On Labs

9 progressive labs — from basic benchmarking to advanced evaluation systems

Lab Overview

The Hands-On Labs pillar provides a progressive learning path from beginner to advanced evaluation techniques. Each lab builds on previous knowledge and introduces new tools, concepts, and frameworks for evaluating language models.

Progressive Learning Path

The labs are organized by difficulty level:

Green (Beginner): Labs 1-2. Foundational skills with standard benchmarking and LLM-as-judge evaluation.
Yellow (Intermediate): Labs 3-6. Domain-specific evaluation, RAG evaluation, and statistical analysis.
Red (Advanced): Labs 7-9. Agentic evaluation, CI/CD integration, and Pareto-optimal model selection.

Prerequisites

Python 3.11 or higher
Basic familiarity with command line and Git
Understanding of machine learning evaluation concepts (though not required for Lab 1)
API keys for at least one LLM provider (OpenAI, Anthropic, etc.)
Jupyter notebook environment (recommended for hands-on work)

What You'll Build

By completing all 9 labs, you'll have hands-on experience with:

Running comprehensive benchmark suites on open-weight and closed models
Building custom evaluation rubrics and LLM-as-judge systems
Evaluating RAG systems and domain-specific applications
Performing statistical analysis and significance testing
Red-teaming models for safety and robustness
Evaluating agentic and complex reasoning systems
Integrating evaluation into CI/CD pipelines
Making data-driven model selection decisions using Pareto analysis

Lab Directory

Each lab includes step-by-step instructions, code examples, and practical exercises. All labs use Jupyter notebooks and have estimated completion times.

Lab 1: Basic Benchmark Suite

Install lm-evaluation-harness, run MMLU-Pro and IFEval on open-weight models, and interpret results. Create your first evaluation baseline.

Skills: Framework setup, benchmark configuration, result interpretation

Lab 2: LLM-as-Judge

Design evaluation rubrics, implement pointwise LLM judges using LiteLLM, and measure inter-rater agreement. Calibrate judges for better accuracy.

Skills: Rubric design, LLM API integration, Cohen's kappa, calibration

Lab 3: RAG Evaluation

Build and evaluate retrieval-augmented generation systems. Measure context relevance, faithfulness, and answer quality using RAGAS framework.

Skills: RAG pipelines, retrieval metrics, faithfulness evaluation, semantic search

Lab 4: Custom Domain Evaluation

Create evaluation suites for specialized domains (legal, medical, technical). Build domain-specific metrics and benchmarks from scratch.

Skills: Domain analysis, custom metric design, benchmark creation, data annotation

Lab 5: Statistical Analysis

Perform significance testing, confidence intervals, and power analysis. Compare model performance rigorously with statistical backing.

Skills: Hypothesis testing, t-tests, bootstrap methods, effect sizes, sample sizing

Lab 6: Safety Red-Teaming

Design adversarial prompts to identify model vulnerabilities. Evaluate robustness against jailbreaks, toxic inputs, and edge cases.

Skills: Adversarial testing, attack design, vulnerability identification, safety metrics

Lab 7: Agentic Evaluation

Evaluate complex systems with multi-step reasoning and tool use. Assess trajectory quality, step-level accuracy, and long-horizon planning.

Skills: Agent evaluation frameworks, trajectory analysis, tool integration, planning metrics

Lab 8: CI/CD Integration

Automate evaluation in continuous integration pipelines. Set up guardrails, performance monitoring, and regression detection in production.

Skills: CI/CD setup, automated testing, monitoring, alerting, regression detection

Lab 9: Pareto Model Selection

Analyze trade-offs between accuracy, cost, and latency. Use Pareto frontier analysis to select optimal models for different use cases.

Skills: Multi-objective optimization, Pareto analysis, cost-benefit modeling, interactive visualization

Getting Started

Setup Instructions

All labs assume you have a working Python environment with Jupyter notebooks.

Create a dedicated workspace:
- mkdir -p ~/llm-eval-labs && cd ~/llm-eval-labs
Clone the evaluation framework repository:
- git clone https://github.com/your-org/llm-evaluation-framework.git
- cd llm-evaluation-framework
Create a Python virtual environment:
- python3.11 -m venv venv
- source venv/bin/activate # On Windows: venv\Scripts\activate
Install dependencies:
- pip install --upgrade pip
- pip install -r requirements-labs.txt
Configure your API credentials (if using closed-source models):
- Create a .env file in the labs directory
- Add your API keys: OPENAI_API_KEY=sk-..., ANTHROPIC_API_KEY=sk-ant-..., etc.

Jupyter Environment

Each lab includes a Jupyter notebook for interactive learning and experimentation:

Start Jupyter from the labs directory:
- jupyter notebook
Navigate to the labs/ directory
Open the notebook for your lab (e.g., lab-01-basic-benchmark.ipynb)
Follow the guided steps and run cells sequentially

Prerequisites by Lab

Labs 1-2: Python 3.11+, pip, virtualenv. API key for at least one LLM provider.
Lab 3: Completion of Lab 1-2. Vector database (Chroma, Pinecone, or similar).
Lab 4: Domain expertise or domain-specific dataset. Access to subject matter experts for annotation.
Lab 5: Completion of Lab 1-3. Basic statistics knowledge (distributions, p-values).
Lab 6: Completion of Lab 1-2. Optional: security/ML background for designing effective adversarial prompts.
Lab 7: Completion of Lab 1-3. Access to a tool-use or agentic framework (LangChain, AutoGen, etc.).
Lab 8: Completion of Lab 1-5. Familiarity with CI/CD tools (GitHub Actions, GitLab CI, etc.).
Lab 9: Completion of Lab 1-5. Understanding of multi-objective optimization concepts.

Practical Tips

Start with Lab 1: Even if you're experienced, Lab 1 establishes the evaluation framework and concepts used in all subsequent labs. Skipping it may cause confusion.
Run on Representative Data: Don't just use the provided examples. Find or create test cases that reflect your actual use case. Evaluation results on toy data often don't generalize.
Track Results Systematically: Save all evaluation outputs, model configurations, and benchmark results to a version-controlled directory. Use consistent naming conventions (e.g., eval_2025-04-08_model-v1.json).
Use Appropriate Compute: Some labs benefit from GPU acceleration. Lab 1 can run on CPU but will be much faster with a GPU. Labs 6-9 may require significant compute—plan accordingly or use cloud services.
Iterate on Rubrics: In Lab 2, your initial rubric won't be perfect. Run evaluation, analyze disagreements between human and LLM judges, and refine. Iteration is key.
Document Decisions: For Labs 4 and 9, document why you chose specific metrics, models, or thresholds. Future-you (and your team) will appreciate the reasoning.
Validate on Out-of-Distribution Data: Always hold out a test set that's different from your training/evaluation data. Evaluation metrics that look good on in-distribution data may not transfer.
Control for Confounding Variables: When comparing models, ensure all variables except the one you're testing are controlled. For example, when testing prompt engineering, use identical models and only vary prompts.
Engage Domain Experts Early: Especially in Labs 4 and 6, involve domain experts in rubric design and adversarial prompt creation. Their insights are invaluable.
Monitor for Bias: In Labs 5 and 9, disaggregate results by demographic groups, topics, or other relevant dimensions. A model may perform well overall but poorly on specific subgroups.

Related Resources

← Framework Overview

Return to the main LLM Evaluation Framework

Core Evaluation Methods

Deep-dive into evaluation methodologies and best practices

Methods Tooling & Infrastructure

Tools, frameworks, and infrastructure for running evaluations at scale

Tools Accuracy Pillar

Learn about measuring correctness, factuality, and hallucinations

Pillar Robustness Pillar

Evaluate model resilience against adversarial inputs and edge cases

Pillar Other Pillars

Explore Efficiency, Fairness, Interpretability, and more

Explore