evaluation-metrics
About
This Claude Skill automatically activates during LLM performance evaluation to ensure proper metrics and testing. It handles evaluation datasets, computes metrics, facilitates A/B testing, and implements LLM-as-judge patterns. Use it when you need structured experiment tracking and rigorous performance assessment for your LLM applications.
Quick Install
Claude Code
Recommendednpx skills add mattnigh/skills_collection -a claude-code/plugin add https://github.com/mattnigh/skills_collectiongit clone https://github.com/mattnigh/skills_collection.git ~/.claude/skills/evaluation-metricsCopy and paste this command in Claude Code to install this skill
GitHub Repository
Related Skills
model-selection
OtherThis Claude Skill automatically guides model and provider selection for LLM applications. It provides patterns for cost optimization, fallback strategies, and multi-model routing across providers like OpenAI and Anthropic. Use it when implementing model comparison, provider failover, or performance/cost trade-offs in your LLM system.
agent-orchestration-patterns
OtherThis Claude Skill automatically guides multi-agent system design by enforcing proper tool schema creation with Pydantic, managing agent states, and implementing robust error handling. It provides orchestration patterns for reliable tool-calling workflows and agent routing. Use it when building complex agent systems to ensure maintainable and structured interactions.
ai-security
OtherThe ai-security skill automatically applies security protections for AI/LLM applications. It provides prompt injection detection, PII redaction, output filtering, and content moderation. Use this skill when building LLM applications that need built-in security guardrails.
rag-design-patterns
OtherThis Claude Skill automatically provides RAG system design patterns when building retrieval-augmented generation applications. It offers guidance on document chunking strategies, vector database selection, embedding management, and retrieval optimization techniques. Developers should use it when implementing RAG systems to ensure proper architecture and performance.
