Back to Skills

evaluating-llms-harness

davila7
Updated 10 days ago
152 views
18,478
1,685
18,478
View on GitHub
TestingEvaluationLM Evaluation HarnessBenchmarkingMMLUHumanEvalGSM8KEleutherAIModel QualityAcademic BenchmarksIndustry Standard

About

This skill runs standardized LLM evaluations across 60+ academic benchmarks like MMLU and GSM8K using the industry-standard lm-evaluation-harness. Use it for benchmarking model quality, comparing different models, or tracking training progress with support for HuggingFace, vLLM, and API-based models. It provides a consistent, widely-adopted method for reporting academic results.

Quick Install

Claude Code

Recommended
Primary
npx skills add davila7/claude-code-templates -a claude-code
Plugin CommandAlternative
/plugin add https://github.com/davila7/claude-code-templates
Git CloneAlternative
git clone https://github.com/davila7/claude-code-templates.git ~/.claude/skills/evaluating-llms-harness

Copy and paste this command in Claude Code to install this skill

GitHub Repository

davila7/claude-code-templates
Path: cli-tool/components/skills/ai-research/evaluation-lm-evaluation-harness
0
anthropicanthropic-claudeclaudeclaude-code

Related Skills

evaluating-code-models

Meta

This skill benchmarks code generation models using industry-standard evaluations like HumanEval and MBPP across multiple programming languages. It calculates pass@k metrics for comparing model performance, testing multi-language support, and measuring code quality. Developers should use it when rigorously evaluating or comparing coding models, as it's the same tool powering HuggingFace's code leaderboards.

View skill

langsmith-observability

Meta

LangSmith provides LLM observability for tracing, evaluating, and monitoring AI applications. Developers should use it for debugging prompts and chains, systematic output evaluation, and monitoring production systems. Its key capabilities include performance tracing, dataset testing, and analysis of latency and token usage.

View skill

phoenix-observability

Testing

Phoenix is an open-source AI observability platform for tracing, evaluating, and monitoring LLM applications. It provides detailed traces for debugging, runs evaluations on datasets, and offers real-time monitoring for production systems. Key capabilities include experiment pipelines and self-hosted observability without vendor lock-in.

View skill

evaluating-code-models

Meta

This skill runs standardized code generation benchmarks like HumanEval and MBPP to evaluate model performance using pass@k metrics. It's the industry-standard tool from the BigCode Project for comparing coding abilities, testing multi-language support, and measuring code quality. Use it when benchmarking models, comparing their capabilities, or replicating HuggingFace leaderboard evaluations.

View skill