Hands-On Labs
9 progressive labs โ from basic benchmarking to advanced evaluation systems
Lab Overview
The Hands-On Labs pillar provides a progressive learning path from beginner to advanced evaluation techniques. Each lab builds on previous knowledge and introduces new tools, concepts, and frameworks for evaluating language models.
Progressive Learning Path
The labs are organized by difficulty level:
- Green (Beginner): Labs 1-2. Foundational skills with standard benchmarking and LLM-as-judge evaluation.
- Yellow (Intermediate): Labs 3-6. Domain-specific evaluation, RAG evaluation, and statistical analysis.
- Red (Advanced): Labs 7-9. Agentic evaluation, CI/CD integration, and Pareto-optimal model selection.
Prerequisites
- Python 3.11 or higher
- Basic familiarity with command line and Git
- Understanding of machine learning evaluation concepts (though not required for Lab 1)
- API keys for at least one LLM provider (OpenAI, Anthropic, etc.)
- Jupyter notebook environment (recommended for hands-on work)
What You'll Build
By completing all 9 labs, you'll have hands-on experience with:
- Running comprehensive benchmark suites on open-weight and closed models
- Building custom evaluation rubrics and LLM-as-judge systems
- Evaluating RAG systems and domain-specific applications
- Performing statistical analysis and significance testing
- Red-teaming models for safety and robustness
- Evaluating agentic and complex reasoning systems
- Integrating evaluation into CI/CD pipelines
- Making data-driven model selection decisions using Pareto analysis
Lab Directory
Each lab includes step-by-step instructions, code examples, and practical exercises. All labs use Jupyter notebooks and have estimated completion times.
Install lm-evaluation-harness, run MMLU-Pro and IFEval on open-weight models, and interpret results. Create your first evaluation baseline.
Design evaluation rubrics, implement pointwise LLM judges using LiteLLM, and measure inter-rater agreement. Calibrate judges for better accuracy.
Build and evaluate retrieval-augmented generation systems. Measure context relevance, faithfulness, and answer quality using RAGAS framework.
Create evaluation suites for specialized domains (legal, medical, technical). Build domain-specific metrics and benchmarks from scratch.
Perform significance testing, confidence intervals, and power analysis. Compare model performance rigorously with statistical backing.
Design adversarial prompts to identify model vulnerabilities. Evaluate robustness against jailbreaks, toxic inputs, and edge cases.
Evaluate complex systems with multi-step reasoning and tool use. Assess trajectory quality, step-level accuracy, and long-horizon planning.
Automate evaluation in continuous integration pipelines. Set up guardrails, performance monitoring, and regression detection in production.
Analyze trade-offs between accuracy, cost, and latency. Use Pareto frontier analysis to select optimal models for different use cases.
Getting Started
Setup Instructions
All labs assume you have a working Python environment with Jupyter notebooks.
- Create a dedicated workspace:
mkdir -p ~/llm-eval-labs && cd ~/llm-eval-labs
- Clone the evaluation framework repository:
git clone https://github.com/your-org/llm-evaluation-framework.gitcd llm-evaluation-framework
- Create a Python virtual environment:
python3.11 -m venv venvsource venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install --upgrade pippip install -r requirements-labs.txt
- Configure your API credentials (if using closed-source models):
- Create a
.envfile in the labs directory - Add your API keys:
OPENAI_API_KEY=sk-...,ANTHROPIC_API_KEY=sk-ant-..., etc.
- Create a
Jupyter Environment
Each lab includes a Jupyter notebook for interactive learning and experimentation:
- Start Jupyter from the labs directory:
jupyter notebook
- Navigate to the
labs/directory - Open the notebook for your lab (e.g.,
lab-01-basic-benchmark.ipynb) - Follow the guided steps and run cells sequentially
Prerequisites by Lab
- Labs 1-2: Python 3.11+, pip, virtualenv. API key for at least one LLM provider.
- Lab 3: Completion of Lab 1-2. Vector database (Chroma, Pinecone, or similar).
- Lab 4: Domain expertise or domain-specific dataset. Access to subject matter experts for annotation.
- Lab 5: Completion of Lab 1-3. Basic statistics knowledge (distributions, p-values).
- Lab 6: Completion of Lab 1-2. Optional: security/ML background for designing effective adversarial prompts.
- Lab 7: Completion of Lab 1-3. Access to a tool-use or agentic framework (LangChain, AutoGen, etc.).
- Lab 8: Completion of Lab 1-5. Familiarity with CI/CD tools (GitHub Actions, GitLab CI, etc.).
- Lab 9: Completion of Lab 1-5. Understanding of multi-objective optimization concepts.
Practical Tips
- Start with Lab 1: Even if you're experienced, Lab 1 establishes the evaluation framework and concepts used in all subsequent labs. Skipping it may cause confusion.
- Run on Representative Data: Don't just use the provided examples. Find or create test cases that reflect your actual use case. Evaluation results on toy data often don't generalize.
- Track Results Systematically: Save all evaluation outputs, model configurations, and benchmark results to a version-controlled directory. Use consistent naming conventions (e.g.,
eval_2025-04-08_model-v1.json). - Use Appropriate Compute: Some labs benefit from GPU acceleration. Lab 1 can run on CPU but will be much faster with a GPU. Labs 6-9 may require significant computeโplan accordingly or use cloud services.
- Iterate on Rubrics: In Lab 2, your initial rubric won't be perfect. Run evaluation, analyze disagreements between human and LLM judges, and refine. Iteration is key.
- Document Decisions: For Labs 4 and 9, document why you chose specific metrics, models, or thresholds. Future-you (and your team) will appreciate the reasoning.
- Validate on Out-of-Distribution Data: Always hold out a test set that's different from your training/evaluation data. Evaluation metrics that look good on in-distribution data may not transfer.
- Control for Confounding Variables: When comparing models, ensure all variables except the one you're testing are controlled. For example, when testing prompt engineering, use identical models and only vary prompts.
- Engage Domain Experts Early: Especially in Labs 4 and 6, involve domain experts in rubric design and adversarial prompt creation. Their insights are invaluable.
- Monitor for Bias: In Labs 5 and 9, disaggregate results by demographic groups, topics, or other relevant dimensions. A model may perform well overall but poorly on specific subgroups.
Related Resources
Return to the main LLM Evaluation Framework
Core Evaluation MethodsDeep-dive into evaluation methodologies and best practices
Methods Tooling & InfrastructureTools, frameworks, and infrastructure for running evaluations at scale
Tools Accuracy PillarLearn about measuring correctness, factuality, and hallucinations
Pillar Robustness PillarEvaluate model resilience against adversarial inputs and edge cases
Pillar Other PillarsExplore Efficiency, Fairness, Interpretability, and more
Explore