Hands-On Labs

9 progressive labs โ€” from basic benchmarking to advanced evaluation systems

Lab Overview

The Hands-On Labs pillar provides a progressive learning path from beginner to advanced evaluation techniques. Each lab builds on previous knowledge and introduces new tools, concepts, and frameworks for evaluating language models.

Progressive Learning Path

The labs are organized by difficulty level:

Prerequisites

What You'll Build

By completing all 9 labs, you'll have hands-on experience with:

Lab Directory

Each lab includes step-by-step instructions, code examples, and practical exercises. All labs use Jupyter notebooks and have estimated completion times.

Lab 1: Basic Benchmark Suite

Install lm-evaluation-harness, run MMLU-Pro and IFEval on open-weight models, and interpret results. Create your first evaluation baseline.

Skills: Framework setup, benchmark configuration, result interpretation
Lab 2: LLM-as-Judge

Design evaluation rubrics, implement pointwise LLM judges using LiteLLM, and measure inter-rater agreement. Calibrate judges for better accuracy.

Skills: Rubric design, LLM API integration, Cohen's kappa, calibration
Lab 3: RAG Evaluation

Build and evaluate retrieval-augmented generation systems. Measure context relevance, faithfulness, and answer quality using RAGAS framework.

Skills: RAG pipelines, retrieval metrics, faithfulness evaluation, semantic search
Lab 4: Custom Domain Evaluation

Create evaluation suites for specialized domains (legal, medical, technical). Build domain-specific metrics and benchmarks from scratch.

Skills: Domain analysis, custom metric design, benchmark creation, data annotation
Lab 5: Statistical Analysis

Perform significance testing, confidence intervals, and power analysis. Compare model performance rigorously with statistical backing.

Skills: Hypothesis testing, t-tests, bootstrap methods, effect sizes, sample sizing
Lab 6: Safety Red-Teaming

Design adversarial prompts to identify model vulnerabilities. Evaluate robustness against jailbreaks, toxic inputs, and edge cases.

Skills: Adversarial testing, attack design, vulnerability identification, safety metrics
Lab 7: Agentic Evaluation

Evaluate complex systems with multi-step reasoning and tool use. Assess trajectory quality, step-level accuracy, and long-horizon planning.

Skills: Agent evaluation frameworks, trajectory analysis, tool integration, planning metrics
Lab 8: CI/CD Integration

Automate evaluation in continuous integration pipelines. Set up guardrails, performance monitoring, and regression detection in production.

Skills: CI/CD setup, automated testing, monitoring, alerting, regression detection
Lab 9: Pareto Model Selection

Analyze trade-offs between accuracy, cost, and latency. Use Pareto frontier analysis to select optimal models for different use cases.

Skills: Multi-objective optimization, Pareto analysis, cost-benefit modeling, interactive visualization

Getting Started

Setup Instructions

All labs assume you have a working Python environment with Jupyter notebooks.

  1. Create a dedicated workspace:
    • mkdir -p ~/llm-eval-labs && cd ~/llm-eval-labs
  2. Clone the evaluation framework repository:
    • git clone https://github.com/your-org/llm-evaluation-framework.git
    • cd llm-evaluation-framework
  3. Create a Python virtual environment:
    • python3.11 -m venv venv
    • source venv/bin/activate # On Windows: venv\Scripts\activate
  4. Install dependencies:
    • pip install --upgrade pip
    • pip install -r requirements-labs.txt
  5. Configure your API credentials (if using closed-source models):
    • Create a .env file in the labs directory
    • Add your API keys: OPENAI_API_KEY=sk-..., ANTHROPIC_API_KEY=sk-ant-..., etc.

Jupyter Environment

Each lab includes a Jupyter notebook for interactive learning and experimentation:

  1. Start Jupyter from the labs directory:
    • jupyter notebook
  2. Navigate to the labs/ directory
  3. Open the notebook for your lab (e.g., lab-01-basic-benchmark.ipynb)
  4. Follow the guided steps and run cells sequentially

Prerequisites by Lab

Practical Tips

  • Start with Lab 1: Even if you're experienced, Lab 1 establishes the evaluation framework and concepts used in all subsequent labs. Skipping it may cause confusion.
  • Run on Representative Data: Don't just use the provided examples. Find or create test cases that reflect your actual use case. Evaluation results on toy data often don't generalize.
  • Track Results Systematically: Save all evaluation outputs, model configurations, and benchmark results to a version-controlled directory. Use consistent naming conventions (e.g., eval_2025-04-08_model-v1.json).
  • Use Appropriate Compute: Some labs benefit from GPU acceleration. Lab 1 can run on CPU but will be much faster with a GPU. Labs 6-9 may require significant computeโ€”plan accordingly or use cloud services.
  • Iterate on Rubrics: In Lab 2, your initial rubric won't be perfect. Run evaluation, analyze disagreements between human and LLM judges, and refine. Iteration is key.
  • Document Decisions: For Labs 4 and 9, document why you chose specific metrics, models, or thresholds. Future-you (and your team) will appreciate the reasoning.
  • Validate on Out-of-Distribution Data: Always hold out a test set that's different from your training/evaluation data. Evaluation metrics that look good on in-distribution data may not transfer.
  • Control for Confounding Variables: When comparing models, ensure all variables except the one you're testing are controlled. For example, when testing prompt engineering, use identical models and only vary prompts.
  • Engage Domain Experts Early: Especially in Labs 4 and 6, involve domain experts in rubric design and adversarial prompt creation. Their insights are invaluable.
  • Monitor for Bias: In Labs 5 and 9, disaggregate results by demographic groups, topics, or other relevant dimensions. A model may perform well overall but poorly on specific subgroups.

Related Resources