Back to Skills

evaluating-llms-harness

zechenzhangAGI
Updated Today
122 views
62
2
62
View on GitHub
Testingaitestingapi

About

This Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.

Documentation

lm-evaluation-harness - LLM Benchmarking

Quick start

lm-evaluation-harness evaluates LLMs across 60+ academic benchmarks using standardized prompts and metrics.

Installation:

pip install lm-eval

Evaluate any HuggingFace model:

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks mmlu,gsm8k,hellaswag \
  --device cuda:0 \
  --batch_size 8

View available tasks:

lm_eval --tasks list

Common workflows

Workflow 1: Standard benchmark evaluation

Evaluate model on core benchmarks (MMLU, GSM8K, HumanEval).

Copy this checklist:

Benchmark Evaluation:
- [ ] Step 1: Choose benchmark suite
- [ ] Step 2: Configure model
- [ ] Step 3: Run evaluation
- [ ] Step 4: Analyze results

Step 1: Choose benchmark suite

Core reasoning benchmarks:

  • MMLU (Massive Multitask Language Understanding) - 57 subjects, multiple choice
  • GSM8K - Grade school math word problems
  • HellaSwag - Common sense reasoning
  • TruthfulQA - Truthfulness and factuality
  • ARC (AI2 Reasoning Challenge) - Science questions

Code benchmarks:

  • HumanEval - Python code generation (164 problems)
  • MBPP (Mostly Basic Python Problems) - Python coding

Standard suite (recommended for model releases):

--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge

Step 2: Configure model

HuggingFace model:

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf,dtype=bfloat16 \
  --tasks mmlu \
  --device cuda:0 \
  --batch_size auto  # Auto-detect optimal batch size

Quantized model (4-bit/8-bit):

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf,load_in_4bit=True \
  --tasks mmlu \
  --device cuda:0

Custom checkpoint:

lm_eval --model hf \
  --model_args pretrained=/path/to/my-model,tokenizer=/path/to/tokenizer \
  --tasks mmlu \
  --device cuda:0

Step 3: Run evaluation

# Full MMLU evaluation (57 subjects)
lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks mmlu \
  --num_fewshot 5 \  # 5-shot evaluation (standard)
  --batch_size 8 \
  --output_path results/ \
  --log_samples  # Save individual predictions

# Multiple benchmarks at once
lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge \
  --num_fewshot 5 \
  --batch_size 8 \
  --output_path results/llama2-7b-eval.json

Step 4: Analyze results

Results saved to results/llama2-7b-eval.json:

{
  "results": {
    "mmlu": {
      "acc": 0.459,
      "acc_stderr": 0.004
    },
    "gsm8k": {
      "exact_match": 0.142,
      "exact_match_stderr": 0.006
    },
    "hellaswag": {
      "acc_norm": 0.765,
      "acc_norm_stderr": 0.004
    }
  },
  "config": {
    "model": "hf",
    "model_args": "pretrained=meta-llama/Llama-2-7b-hf",
    "num_fewshot": 5
  }
}

Workflow 2: Track training progress

Evaluate checkpoints during training.

Training Progress Tracking:
- [ ] Step 1: Set up periodic evaluation
- [ ] Step 2: Choose quick benchmarks
- [ ] Step 3: Automate evaluation
- [ ] Step 4: Plot learning curves

Step 1: Set up periodic evaluation

Evaluate every N training steps:

#!/bin/bash
# eval_checkpoint.sh

CHECKPOINT_DIR=$1
STEP=$2

lm_eval --model hf \
  --model_args pretrained=$CHECKPOINT_DIR/checkpoint-$STEP \
  --tasks gsm8k,hellaswag \
  --num_fewshot 0 \  # 0-shot for speed
  --batch_size 16 \
  --output_path results/step-$STEP.json

Step 2: Choose quick benchmarks

Fast benchmarks for frequent evaluation:

  • HellaSwag: ~10 minutes on 1 GPU
  • GSM8K: ~5 minutes
  • PIQA: ~2 minutes

Avoid for frequent eval (too slow):

  • MMLU: ~2 hours (57 subjects)
  • HumanEval: Requires code execution

Step 3: Automate evaluation

Integrate with training script:

# In training loop
if step % eval_interval == 0:
    model.save_pretrained(f"checkpoints/step-{step}")

    # Run evaluation
    os.system(f"./eval_checkpoint.sh checkpoints step-{step}")

Or use PyTorch Lightning callbacks:

from pytorch_lightning import Callback

class EvalHarnessCallback(Callback):
    def on_validation_epoch_end(self, trainer, pl_module):
        step = trainer.global_step
        checkpoint_path = f"checkpoints/step-{step}"

        # Save checkpoint
        trainer.save_checkpoint(checkpoint_path)

        # Run lm-eval
        os.system(f"lm_eval --model hf --model_args pretrained={checkpoint_path} ...")

Step 4: Plot learning curves

import json
import matplotlib.pyplot as plt

# Load all results
steps = []
mmlu_scores = []

for file in sorted(glob.glob("results/step-*.json")):
    with open(file) as f:
        data = json.load(f)
        step = int(file.split("-")[1].split(".")[0])
        steps.append(step)
        mmlu_scores.append(data["results"]["mmlu"]["acc"])

# Plot
plt.plot(steps, mmlu_scores)
plt.xlabel("Training Step")
plt.ylabel("MMLU Accuracy")
plt.title("Training Progress")
plt.savefig("training_curve.png")

Workflow 3: Compare multiple models

Benchmark suite for model comparison.

Model Comparison:
- [ ] Step 1: Define model list
- [ ] Step 2: Run evaluations
- [ ] Step 3: Generate comparison table

Step 1: Define model list

# models.txt
meta-llama/Llama-2-7b-hf
meta-llama/Llama-2-13b-hf
mistralai/Mistral-7B-v0.1
microsoft/phi-2

Step 2: Run evaluations

#!/bin/bash
# eval_all_models.sh

TASKS="mmlu,gsm8k,hellaswag,truthfulqa"

while read model; do
    echo "Evaluating $model"

    # Extract model name for output file
    model_name=$(echo $model | sed 's/\//-/g')

    lm_eval --model hf \
      --model_args pretrained=$model,dtype=bfloat16 \
      --tasks $TASKS \
      --num_fewshot 5 \
      --batch_size auto \
      --output_path results/$model_name.json

done < models.txt

Step 3: Generate comparison table

import json
import pandas as pd

models = [
    "meta-llama-Llama-2-7b-hf",
    "meta-llama-Llama-2-13b-hf",
    "mistralai-Mistral-7B-v0.1",
    "microsoft-phi-2"
]

tasks = ["mmlu", "gsm8k", "hellaswag", "truthfulqa"]

results = []
for model in models:
    with open(f"results/{model}.json") as f:
        data = json.load(f)
        row = {"Model": model.replace("-", "/")}
        for task in tasks:
            # Get primary metric for each task
            metrics = data["results"][task]
            if "acc" in metrics:
                row[task.upper()] = f"{metrics['acc']:.3f}"
            elif "exact_match" in metrics:
                row[task.upper()] = f"{metrics['exact_match']:.3f}"
        results.append(row)

df = pd.DataFrame(results)
print(df.to_markdown(index=False))

Output:

| Model                  | MMLU  | GSM8K | HELLASWAG | TRUTHFULQA |
|------------------------|-------|-------|-----------|------------|
| meta-llama/Llama-2-7b  | 0.459 | 0.142 | 0.765     | 0.391      |
| meta-llama/Llama-2-13b | 0.549 | 0.287 | 0.801     | 0.430      |
| mistralai/Mistral-7B   | 0.626 | 0.395 | 0.812     | 0.428      |
| microsoft/phi-2        | 0.560 | 0.613 | 0.682     | 0.447      |

Workflow 4: Evaluate with vLLM (faster inference)

Use vLLM backend for 5-10x faster evaluation.

vLLM Evaluation:
- [ ] Step 1: Install vLLM
- [ ] Step 2: Configure vLLM backend
- [ ] Step 3: Run evaluation

Step 1: Install vLLM

pip install vllm

Step 2: Configure vLLM backend

lm_eval --model vllm \
  --model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8 \
  --tasks mmlu \
  --batch_size auto

Step 3: Run evaluation

vLLM is 5-10× faster than standard HuggingFace:

# Standard HF: ~2 hours for MMLU on 7B model
lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks mmlu \
  --batch_size 8

# vLLM: ~15-20 minutes for MMLU on 7B model
lm_eval --model vllm \
  --model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=2 \
  --tasks mmlu \
  --batch_size auto

When to use vs alternatives

Use lm-evaluation-harness when:

  • Benchmarking models for academic papers
  • Comparing model quality across standard tasks
  • Tracking training progress
  • Reporting standardized metrics (everyone uses same prompts)
  • Need reproducible evaluation

Use alternatives instead:

  • HELM (Stanford): Broader evaluation (fairness, efficiency, calibration)
  • AlpacaEval: Instruction-following evaluation with LLM judges
  • MT-Bench: Conversational multi-turn evaluation
  • Custom scripts: Domain-specific evaluation

Common issues

Issue: Evaluation too slow

Use vLLM backend:

lm_eval --model vllm \
  --model_args pretrained=model-name,tensor_parallel_size=2

Or reduce fewshot examples:

--num_fewshot 0  # Instead of 5

Or evaluate subset of MMLU:

--tasks mmlu_stem  # Only STEM subjects

Issue: Out of memory

Reduce batch size:

--batch_size 1  # Or --batch_size auto

Use quantization:

--model_args pretrained=model-name,load_in_8bit=True

Enable CPU offloading:

--model_args pretrained=model-name,device_map=auto,offload_folder=offload

Issue: Different results than reported

Check fewshot count:

--num_fewshot 5  # Most papers use 5-shot

Check exact task name:

--tasks mmlu  # Not mmlu_direct or mmlu_fewshot

Verify model and tokenizer match:

--model_args pretrained=model-name,tokenizer=same-model-name

Issue: HumanEval not executing code

Install execution dependencies:

pip install human-eval

Enable code execution:

lm_eval --model hf \
  --model_args pretrained=model-name \
  --tasks humaneval \
  --allow_code_execution  # Required for HumanEval

Advanced topics

Benchmark descriptions: See references/benchmark-guide.md for detailed description of all 60+ tasks, what they measure, and interpretation.

Custom tasks: See references/custom-tasks.md for creating domain-specific evaluation tasks.

API evaluation: See references/api-evaluation.md for evaluating OpenAI, Anthropic, and other API models.

Multi-GPU strategies: See references/distributed-eval.md for data parallel and tensor parallel evaluation.

Hardware requirements

  • GPU: NVIDIA (CUDA 11.8+), works on CPU (very slow)
  • VRAM:
    • 7B model: 16GB (bf16) or 8GB (8-bit)
    • 13B model: 28GB (bf16) or 14GB (8-bit)
    • 70B model: Requires multi-GPU or quantization
  • Time (7B model, single A100):
    • HellaSwag: 10 minutes
    • GSM8K: 5 minutes
    • MMLU (full): 2 hours
    • HumanEval: 20 minutes

Resources

Quick Install

/plugin add https://github.com/zechenzhangAGI/AI-research-SKILLs/tree/main/lm-evaluation-harness

Copy and paste this command in Claude Code to install this skill

GitHub 仓库

zechenzhangAGI/AI-research-SKILLs
Path: 11-evaluation/lm-evaluation-harness
aiai-researchclaudeclaude-codeclaude-skillscodex

Related Skills

llamaguard

Other

LlamaGuard is Meta's 7-8B parameter model for moderating LLM inputs and outputs across six safety categories like violence and hate speech. It offers 94-95% accuracy and can be deployed using vLLM, Hugging Face, or Amazon SageMaker. Use this skill to easily integrate content filtering and safety guardrails into your AI applications.

View skill

sglang

Meta

SGLang is a high-performance LLM serving framework that specializes in fast, structured generation for JSON, regex, and agentic workflows using its RadixAttention prefix caching. It delivers significantly faster inference, especially for tasks with repeated prefixes, making it ideal for complex, structured outputs and multi-turn conversations. Choose SGLang over alternatives like vLLM when you need constrained decoding or are building applications with extensive prefix sharing.

View skill

langchain

Meta

LangChain is a framework for building LLM applications using agents, chains, and RAG pipelines. It supports multiple LLM providers, offers 500+ integrations, and includes features like tool calling and memory management. Use it for rapid prototyping and deploying production systems like chatbots, autonomous agents, and question-answering services.

View skill

go-test

Meta

The go-test skill provides expertise in Go's standard testing package and best practices. It helps developers implement table-driven tests, subtests, benchmarks, and coverage strategies while following Go conventions. Use it when writing test files, creating mocks, detecting race conditions, or organizing integration tests in Go projects.

View skill