evaluating-llms-harness

davila7

Updated 10 days ago

152 views

18,478

1,685

18,478

TestingEvaluationLM Evaluation HarnessBenchmarkingMMLUHumanEvalGSM8KEleutherAIModel QualityAcademic BenchmarksIndustry Standard

About

This skill runs standardized LLM evaluations across 60+ academic benchmarks like MMLU and GSM8K using the industry-standard lm-evaluation-harness. Use it for benchmarking model quality, comparing different models, or tracking training progress with support for HuggingFace, vLLM, and API-based models. It provides a consistent, widely-adopted method for reporting academic results.

Quick Install

Claude Code

Recommended

Primary

npx skills add davila7/claude-code-templates -a claude-code

Plugin CommandAlternative

/plugin add https://github.com/davila7/claude-code-templates

Git CloneAlternative

git clone https://github.com/davila7/claude-code-templates.git ~/.claude/skills/evaluating-llms-harness

Copy and paste this command in Claude Code to install this skill

GitHub Repository

davila7/claude-code-templates

Path: cli-tool/components/skills/ai-research/evaluation-lm-evaluation-harness

anthropicanthropic-claudeclaudeclaude-code

Related Skills

evaluating-code-models

langsmith-observability

phoenix-observability

Testing

Phoenix is an open-source AI observability platform for tracing, evaluating, and monitoring LLM applications. It provides detailed traces for debugging, runs evaluations on datasets, and offers real-time monitoring for production systems. Key capabilities include experiment pipelines and self-hosted observability without vendor lock-in.

View skill