Back to Skills

evaluating-code-models

davila7
Updated Today
12 views
15,516
1,344
15,516
View on GitHub
MetaEvaluationCode GenerationHumanEvalMBPPMultiPL-EPass@kBigCodeBenchmarkingCode Models

About

This skill provides standardized benchmarking for code generation models using the BigCode Evaluation Harness. It evaluates models across major benchmarks like HumanEval and MBPP with pass@k metrics, supporting multi-language testing. Use it to objectively compare coding abilities and measure generation quality during model assessment.

Quick Install

Claude Code

Recommended
Plugin CommandRecommended
/plugin add https://github.com/davila7/claude-code-templates
Git CloneAlternative
git clone https://github.com/davila7/claude-code-templates.git ~/.claude/skills/evaluating-code-models

Copy and paste this command in Claude Code to install this skill

Documentation

BigCode Evaluation Harness - Code Model Benchmarking

Quick Start

BigCode Evaluation Harness evaluates code generation models across 15+ benchmarks including HumanEval, MBPP, and MultiPL-E (18 languages).

Installation:

git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
cd bigcode-evaluation-harness
pip install -e .
accelerate config

Evaluate on HumanEval:

accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks humaneval \
  --max_length_generation 512 \
  --temperature 0.2 \
  --n_samples 20 \
  --batch_size 10 \
  --allow_code_execution \
  --save_generations

View available tasks:

python -c "from bigcode_eval.tasks import ALL_TASKS; print(ALL_TASKS)"

Common Workflows

Workflow 1: Standard Code Benchmark Evaluation

Evaluate model on core code benchmarks (HumanEval, MBPP, HumanEval+).

Checklist:

Code Benchmark Evaluation:
- [ ] Step 1: Choose benchmark suite
- [ ] Step 2: Configure model and generation
- [ ] Step 3: Run evaluation with code execution
- [ ] Step 4: Analyze pass@k results

Step 1: Choose benchmark suite

Python code generation (most common):

  • HumanEval: 164 handwritten problems, function completion
  • HumanEval+: Same 164 problems with 80× more tests (stricter)
  • MBPP: 500 crowd-sourced problems, entry-level difficulty
  • MBPP+: 399 curated problems with 35× more tests

Multi-language (18 languages):

  • MultiPL-E: HumanEval/MBPP translated to C++, Java, JavaScript, Go, Rust, etc.

Advanced:

  • APPS: 10,000 problems (introductory/interview/competition)
  • DS-1000: 1,000 data science problems across 7 libraries

Step 2: Configure model and generation

# Standard HuggingFace model
accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks humaneval \
  --max_length_generation 512 \
  --temperature 0.2 \
  --do_sample True \
  --n_samples 200 \
  --batch_size 50 \
  --allow_code_execution

# Quantized model (4-bit)
accelerate launch main.py \
  --model codellama/CodeLlama-34b-hf \
  --tasks humaneval \
  --load_in_4bit \
  --max_length_generation 512 \
  --allow_code_execution

# Custom/private model
accelerate launch main.py \
  --model /path/to/my-code-model \
  --tasks humaneval \
  --trust_remote_code \
  --use_auth_token \
  --allow_code_execution

Step 3: Run evaluation

# Full evaluation with pass@k estimation (k=1,10,100)
accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks humaneval \
  --temperature 0.8 \
  --n_samples 200 \
  --batch_size 50 \
  --allow_code_execution \
  --save_generations \
  --metric_output_path results/starcoder2-humaneval.json

Step 4: Analyze results

Results in results/starcoder2-humaneval.json:

{
  "humaneval": {
    "pass@1": 0.354,
    "pass@10": 0.521,
    "pass@100": 0.689
  },
  "config": {
    "model": "bigcode/starcoder2-7b",
    "temperature": 0.8,
    "n_samples": 200
  }
}

Workflow 2: Multi-Language Evaluation (MultiPL-E)

Evaluate code generation across 18 programming languages.

Checklist:

Multi-Language Evaluation:
- [ ] Step 1: Generate solutions (host machine)
- [ ] Step 2: Run evaluation in Docker (safe execution)
- [ ] Step 3: Compare across languages

Step 1: Generate solutions on host

# Generate without execution (safe)
accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks multiple-py,multiple-js,multiple-java,multiple-cpp \
  --max_length_generation 650 \
  --temperature 0.8 \
  --n_samples 50 \
  --batch_size 50 \
  --generation_only \
  --save_generations \
  --save_generations_path generations_multi.json

Step 2: Evaluate in Docker container

# Pull the MultiPL-E Docker image
docker pull ghcr.io/bigcode-project/evaluation-harness-multiple

# Run evaluation inside container
docker run -v $(pwd)/generations_multi.json:/app/generations.json:ro \
  -it evaluation-harness-multiple python3 main.py \
  --model bigcode/starcoder2-7b \
  --tasks multiple-py,multiple-js,multiple-java,multiple-cpp \
  --load_generations_path /app/generations.json \
  --allow_code_execution \
  --n_samples 50

Supported languages: Python, JavaScript, Java, C++, Go, Rust, TypeScript, C#, PHP, Ruby, Swift, Kotlin, Scala, Perl, Julia, Lua, R, Racket

Workflow 3: Instruction-Tuned Model Evaluation

Evaluate chat/instruction models with proper formatting.

Checklist:

Instruction Model Evaluation:
- [ ] Step 1: Use instruction-tuned tasks
- [ ] Step 2: Configure instruction tokens
- [ ] Step 3: Run evaluation

Step 1: Choose instruction tasks

  • instruct-humaneval: HumanEval with instruction prompts
  • humanevalsynthesize-{lang}: HumanEvalPack synthesis tasks

Step 2: Configure instruction tokens

# For models with chat templates (e.g., CodeLlama-Instruct)
accelerate launch main.py \
  --model codellama/CodeLlama-7b-Instruct-hf \
  --tasks instruct-humaneval \
  --instruction_tokens "<s>[INST],</s>,[/INST]" \
  --max_length_generation 512 \
  --allow_code_execution

Step 3: HumanEvalPack for instruction models

# Test code synthesis across 6 languages
accelerate launch main.py \
  --model codellama/CodeLlama-7b-Instruct-hf \
  --tasks humanevalsynthesize-python,humanevalsynthesize-js \
  --prompt instruct \
  --max_length_generation 512 \
  --allow_code_execution

Workflow 4: Compare Multiple Models

Benchmark suite for model comparison.

Step 1: Create evaluation script

#!/bin/bash
# eval_models.sh

MODELS=(
  "bigcode/starcoder2-7b"
  "codellama/CodeLlama-7b-hf"
  "deepseek-ai/deepseek-coder-6.7b-base"
)
TASKS="humaneval,mbpp"

for model in "${MODELS[@]}"; do
  model_name=$(echo $model | tr '/' '-')
  echo "Evaluating $model"

  accelerate launch main.py \
    --model $model \
    --tasks $TASKS \
    --temperature 0.2 \
    --n_samples 20 \
    --batch_size 20 \
    --allow_code_execution \
    --metric_output_path results/${model_name}.json
done

Step 2: Generate comparison table

import json
import pandas as pd

models = ["bigcode-starcoder2-7b", "codellama-CodeLlama-7b-hf", "deepseek-ai-deepseek-coder-6.7b-base"]
results = []

for model in models:
    with open(f"results/{model}.json") as f:
        data = json.load(f)
        results.append({
            "Model": model,
            "HumanEval pass@1": f"{data['humaneval']['pass@1']:.3f}",
            "MBPP pass@1": f"{data['mbpp']['pass@1']:.3f}"
        })

df = pd.DataFrame(results)
print(df.to_markdown(index=False))

When to Use vs Alternatives

Use BigCode Evaluation Harness when:

  • Evaluating code generation models specifically
  • Need multi-language evaluation (18 languages via MultiPL-E)
  • Testing functional correctness with unit tests (pass@k)
  • Benchmarking for BigCode/HuggingFace leaderboards
  • Evaluating fill-in-the-middle (FIM) capabilities

Use alternatives instead:

  • lm-evaluation-harness: General LLM benchmarks (MMLU, GSM8K, HellaSwag)
  • EvalPlus: Stricter HumanEval+/MBPP+ with more test cases
  • SWE-bench: Real-world GitHub issue resolution
  • LiveCodeBench: Contamination-free, continuously updated problems
  • CodeXGLUE: Code understanding tasks (clone detection, defect prediction)

Supported Benchmarks

BenchmarkProblemsLanguagesMetricUse Case
HumanEval164Pythonpass@kStandard code completion
HumanEval+164Pythonpass@kStricter evaluation (80× tests)
MBPP500Pythonpass@kEntry-level problems
MBPP+399Pythonpass@kStricter evaluation (35× tests)
MultiPL-E164×1818 languagespass@kMulti-language evaluation
APPS10,000Pythonpass@kCompetition-level
DS-10001,000Pythonpass@kData science (pandas, numpy, etc.)
HumanEvalPack164×3×66 languagespass@kSynthesis/fix/explain
Mercury1,889PythonEfficiencyComputational efficiency

Common Issues

Issue: Different results than reported in papers

Check these factors:

# 1. Verify n_samples (need 200 for accurate pass@k)
--n_samples 200

# 2. Check temperature (0.2 for greedy-ish, 0.8 for sampling)
--temperature 0.8

# 3. Verify task name matches exactly
--tasks humaneval  # Not "human_eval" or "HumanEval"

# 4. Check max_length_generation
--max_length_generation 512  # Increase for longer problems

Issue: CUDA out of memory

# Use quantization
--load_in_8bit
# OR
--load_in_4bit

# Reduce batch size
--batch_size 1

# Set memory limit
--max_memory_per_gpu "20GiB"

Issue: Code execution hangs or times out

Use Docker for safe execution:

# Generate on host (no execution)
--generation_only --save_generations

# Evaluate in Docker
docker run ... --allow_code_execution --load_generations_path ...

Issue: Low scores on instruction models

Ensure proper instruction formatting:

# Use instruction-specific tasks
--tasks instruct-humaneval

# Set instruction tokens for your model
--instruction_tokens "<s>[INST],</s>,[/INST]"

Issue: MultiPL-E language failures

Use the dedicated Docker image:

docker pull ghcr.io/bigcode-project/evaluation-harness-multiple

Command Reference

ArgumentDefaultDescription
--model-HuggingFace model ID or local path
--tasks-Comma-separated task names
--n_samples1Samples per problem (200 for pass@k)
--temperature0.2Sampling temperature
--max_length_generation512Max tokens (prompt + generation)
--batch_size1Batch size per GPU
--allow_code_executionFalseEnable code execution (required)
--generation_onlyFalseGenerate without evaluation
--load_generations_path-Load pre-generated solutions
--save_generationsFalseSave generated code
--metric_output_pathresults.jsonOutput file for metrics
--load_in_8bitFalse8-bit quantization
--load_in_4bitFalse4-bit quantization
--trust_remote_codeFalseAllow custom model code
--precisionfp32Model precision (fp32/fp16/bf16)

Hardware Requirements

Model SizeVRAM (fp16)VRAM (4-bit)Time (HumanEval, n=200)
7B14GB6GB~30 min (A100)
13B26GB10GB~1 hour (A100)
34B68GB20GB~2 hours (A100)

Resources

GitHub Repository

davila7/claude-code-templates
Path: cli-tool/components/skills/ai-research/evaluation-bigcode-evaluation-harness
anthropicanthropic-claudeclaudeclaude-code

Related Skills

phoenix-observability

Testing

phoenix-observability is an open-source platform for tracing, evaluating, and monitoring LLM applications. Use it for debugging with detailed traces, running dataset evaluations, and monitoring production systems in real-time. It offers self-hosted observability with features like experiment pipelines and OpenTelemetry integration.

View skill

langsmith-observability

Meta

This skill integrates LangSmith for LLM observability, enabling tracing, evaluation, and monitoring of AI applications. Developers should use it for debugging prompts and chains, systematically testing model outputs, and monitoring production systems. Key capabilities include performance analysis and building regression testing pipelines.

View skill

evaluating-llms-harness

Testing

This Claude Skill runs the industry-standard lm-evaluation-harness to benchmark LLMs across 60+ academic tasks like MMLU and GSM8K. Use it to rigorously compare model quality, report academic results, or track training progress. It supports evaluating models from HuggingFace, vLLM, and various APIs.

View skill

sglang

Meta

SGLang is a high-performance LLM serving framework that specializes in fast, structured generation for JSON, regex, and agentic workflows using its RadixAttention prefix caching. It delivers significantly faster inference, especially for tasks with repeated prefixes, making it ideal for complex, structured outputs and multi-turn conversations. Choose SGLang over alternatives like vLLM when you need constrained decoding or are building applications with extensive prefix sharing.

View skill