evaluating-code-models

davila7

Aktualisiert 15 days ago

388 Ansichten

18,478

1,685

18,478

Auf GitHub ansehen

MetaEvaluationCode GenerationHumanEvalMBPPMultiPL-EPass@kBigCodeBenchmarkingCode Models

Über

Diese Fähigkeit benchmarkt Codegenerierungsmodelle mithilfe von branchenüblichen Evaluationen wie HumanEval und MBPP über mehrere Programmiersprachen hinweg. Sie berechnet Pass@k-Metriken, um die Modellleistung zu vergleichen, die Mehrsprachenunterstützung zu testen und die Codequalität zu messen. Entwickler sollten sie nutzen, wenn sie Codierungsmodelle rigoros evaluieren oder vergleichen möchten, da es dasselbe Tool ist, das die Code-Bestenlisten von HuggingFace antreibt.

Schnellinstallation

Claude Code

GitHub Repository

davila7/claude-code-templates

Pfad: cli-tool/components/skills/ai-research/evaluation-bigcode-evaluation-harness

anthropicanthropic-claudeclaudeclaude-code

Verwandte Skills

langsmith-observability

phoenix-observability

Testen

Phoenix is an open-source AI observability platform for tracing, evaluating, and monitoring LLM applications. It provides detailed traces for debugging, runs evaluations on datasets, and offers real-time monitoring for production systems. Key capabilities include experiment pipelines and self-hosted observability without vendor lock-in.

Skill ansehen

evaluating-llms-harness

Testen

This skill runs standardized LLM evaluations across 60+ academic benchmarks like MMLU and GSM8K using the industry-standard lm-evaluation-harness. Use it for benchmarking model quality, comparing different models, or tracking training progress with support for HuggingFace, vLLM, and API-based models. It provides a consistent, widely-adopted method for reporting academic results.

Skill ansehen