evaluating-code-models
Über
Diese Fähigkeit benchmarkt Codegenerierungsmodelle mithilfe von branchenüblichen Evaluationen wie HumanEval und MBPP über mehrere Programmiersprachen hinweg. Sie berechnet Pass@k-Metriken, um die Modellleistung zu vergleichen, die Mehrsprachenunterstützung zu testen und die Codequalität zu messen. Entwickler sollten sie nutzen, wenn sie Codierungsmodelle rigoros evaluieren oder vergleichen möchten, da es dasselbe Tool ist, das die Code-Bestenlisten von HuggingFace antreibt.
Schnellinstallation
Claude Code
Empfohlennpx skills add davila7/claude-code-templates -a claude-code/plugin add https://github.com/davila7/claude-code-templatesgit clone https://github.com/davila7/claude-code-templates.git ~/.claude/skills/evaluating-code-modelsKopieren Sie diesen Befehl und fügen Sie ihn in Claude Code ein, um diese Fähigkeit zu installieren
GitHub Repository
Verwandte Skills
langsmith-observability
MetaLangSmith provides LLM observability for tracing, evaluating, and monitoring AI applications. Developers should use it for debugging prompts and chains, systematic output evaluation, and monitoring production systems. Its key capabilities include performance tracing, dataset testing, and analysis of latency and token usage.
phoenix-observability
TestenPhoenix is an open-source AI observability platform for tracing, evaluating, and monitoring LLM applications. It provides detailed traces for debugging, runs evaluations on datasets, and offers real-time monitoring for production systems. Key capabilities include experiment pipelines and self-hosted observability without vendor lock-in.
evaluating-llms-harness
TestenThis skill runs standardized LLM evaluations across 60+ academic benchmarks like MMLU and GSM8K using the industry-standard lm-evaluation-harness. Use it for benchmarking model quality, comparing different models, or tracking training progress with support for HuggingFace, vLLM, and API-based models. It provides a consistent, widely-adopted method for reporting academic results.
evaluating-code-models
MetaThis skill runs standardized code generation benchmarks like HumanEval and MBPP to evaluate model performance using pass@k metrics. It's the industry-standard tool from the BigCode Project for comparing coding abilities, testing multi-language support, and measuring code quality. Use it when benchmarking models, comparing their capabilities, or replicating HuggingFace leaderboard evaluations.
