evaluating-llms-harness
Über
Diese Fähigkeit führt standardisierte LLM-Evaluierungen über 60+ akademischen Benchmarks wie MMLU und GSM8K durch, unter Verwendung der branchenüblichen lm-evaluation-harness. Nutzen Sie sie für Benchmarking der Modellqualität, zum Vergleich verschiedener Modelle oder zur Verfolgung von Trainingsfortschritten mit Unterstützung für HuggingFace-, vLLM- und API-basierte Modelle. Sie bietet eine konsistente, weit verbreitete Methode zur Darstellung akademischer Ergebnisse.
Schnellinstallation
Claude Code
Empfohlennpx skills add davila7/claude-code-templates -a claude-code/plugin add https://github.com/davila7/claude-code-templatesgit clone https://github.com/davila7/claude-code-templates.git ~/.claude/skills/evaluating-llms-harnessKopieren Sie diesen Befehl und fügen Sie ihn in Claude Code ein, um diese Fähigkeit zu installieren
GitHub Repository
Verwandte Skills
evaluating-code-models
MetaThis skill benchmarks code generation models using industry-standard evaluations like HumanEval and MBPP across multiple programming languages. It calculates pass@k metrics for comparing model performance, testing multi-language support, and measuring code quality. Developers should use it when rigorously evaluating or comparing coding models, as it's the same tool powering HuggingFace's code leaderboards.
langsmith-observability
MetaLangSmith provides LLM observability for tracing, evaluating, and monitoring AI applications. Developers should use it for debugging prompts and chains, systematic output evaluation, and monitoring production systems. Its key capabilities include performance tracing, dataset testing, and analysis of latency and token usage.
phoenix-observability
TestenPhoenix is an open-source AI observability platform for tracing, evaluating, and monitoring LLM applications. It provides detailed traces for debugging, runs evaluations on datasets, and offers real-time monitoring for production systems. Key capabilities include experiment pipelines and self-hosted observability without vendor lock-in.
evaluating-code-models
MetaThis skill runs standardized code generation benchmarks like HumanEval and MBPP to evaluate model performance using pass@k metrics. It's the industry-standard tool from the BigCode Project for comparing coding abilities, testing multi-language support, and measuring code quality. Use it when benchmarking models, comparing their capabilities, or replicating HuggingFace leaderboard evaluations.
