Evals

majiayu000

更新日 Today

27 閲覧

テストtesting

について

Evalsは、Anthropicのベストプラクティスを用いてClaude Codeエージェントをテストおよびベンチマークするためのエージェント評価フレームワークです。コードベース、モデルベース、人間による評価の3種類の採点方式、トランスクリプト収集、回帰テストおよび能力テストのためのpass@kメトリクスを提供します。エージェントの動作を評価、検証、またはベンチマークする必要がある場合に、このスキルをご利用ください。

クイックインストール

Claude Code

推奨

プラグインコマンド推奨

/plugin add https://github.com/majiayu000/claude-skill-registry

Git クローン代替

git clone https://github.com/majiayu000/claude-skill-registry.git ~/.claude/skills/Evals

このコマンドをClaude Codeにコピー＆ペーストしてスキルをインストールします

ドキュメント

Customization

Before executing, check for user customizations at: ~/.claude/skills/CORE/USER/SKILLCUSTOMIZATIONS/Evals/

If this directory exists, load and apply any PREFERENCES.md, configurations, or resources found there. These override default behavior. If the directory does not exist, proceed with skill defaults.

🚨 MANDATORY: Voice Notification (REQUIRED BEFORE ANY ACTION)

You MUST send this notification BEFORE doing anything else when this skill is invoked.

Send voice notification:

curl -s -X POST http://localhost:8888/notify \
  -H "Content-Type: application/json" \
  -d '{"message": "Running the WORKFLOWNAME workflow in the Evals skill to ACTION"}' \
  > /dev/null 2>&1 &

Output text notification:

Running the **WorkflowName** workflow in the **Evals** skill to ACTION...

This is not optional. Execute this curl command immediately upon skill invocation.

Evals - AI Agent Evaluation Framework

Comprehensive agent evaluation system based on Anthropic's "Demystifying Evals for AI Agents" (Jan 2026).

Key differentiator: Evaluates agent workflows (transcripts, tool calls, multi-turn conversations), not just single outputs.

When to Activate

"run evals", "test this agent", "evaluate", "check quality", "benchmark"
"regression test", "capability test"
Compare agent behaviors across changes
Validate agent workflows before deployment
Verify ALGORITHM ISC rows
Create new evaluation tasks from failures

Core Concepts

Three Grader Types

Type	Strengths	Weaknesses	Use For
Code-based	Fast, cheap, deterministic, reproducible	Brittle, lacks nuance	Tests, state checks, tool verification
Model-based	Flexible, captures nuance, scalable	Non-deterministic, expensive	Quality rubrics, assertions, comparisons
Human	Gold standard, handles subjectivity	Expensive, slow	Calibration, spot checks, A/B testing

Evaluation Types

Type	Pass Target	Purpose
Capability	~70%	Stretch goals, measuring improvement potential
Regression	~99%	Quality gates, detecting backsliding

Key Metrics

pass@k: Probability of at least 1 success in k trials (measures capability)
pass^k: Probability all k trials succeed (measures consistency/reliability)

Workflow Routing

Trigger	Workflow
"run evals", "evaluate suite"	Run suite via `Tools/AlgorithmBridge.ts`
"log failure"	Log failure via `Tools/FailureToTask.ts log`
"convert failures"	Convert to tasks via `Tools/FailureToTask.ts convert-all`
"create suite"	Create suite via `Tools/SuiteManager.ts create`
"check saturation"	Check via `Tools/SuiteManager.ts check-saturation`

Quick Reference

CLI Commands

# Run an eval suite
bun run ~/.claude/skills/Evals/Tools/AlgorithmBridge.ts -s <suite>

# Log a failure for later conversion
bun run ~/.claude/skills/Evals/Tools/FailureToTask.ts log "description" -c category -s severity

# Convert failures to test tasks
bun run ~/.claude/skills/Evals/Tools/FailureToTask.ts convert-all

# Manage suites
bun run ~/.claude/skills/Evals/Tools/SuiteManager.ts create <name> -t capability -d "description"
bun run ~/.claude/skills/Evals/Tools/SuiteManager.ts list
bun run ~/.claude/skills/Evals/Tools/SuiteManager.ts check-saturation <name>
bun run ~/.claude/skills/Evals/Tools/SuiteManager.ts graduate <name>

ALGORITHM Integration

Evals is a verification method for THE ALGORITHM ISC rows:

# Run eval and update ISC row
bun run ~/.claude/skills/Evals/Tools/AlgorithmBridge.ts -s regression-core -r 3 -u

ISC rows can specify eval verification:

| # | What Ideal Looks Like | Verify |
|---|----------------------|--------|
| 1 | Auth bypass fixed | eval:auth-security |
| 2 | Tests all pass | eval:regression |

Available Graders

Code-Based (Fast, Deterministic)

Grader	Use Case
`string_match`	Exact substring matching
`regex_match`	Pattern matching
`binary_tests`	Run test files
`static_analysis`	Lint, type-check, security scan
`state_check`	Verify system state after execution
`tool_calls`	Verify specific tools were called

Model-Based (Nuanced)

Grader	Use Case
`llm_rubric`	Score against detailed rubric
`natural_language_assert`	Check assertions are true
`pairwise_comparison`	Compare to reference with position swap

Domain Patterns

Pre-configured grader stacks for common agent types:

Domain	Primary Graders
`coding`	binary_tests + static_analysis + tool_calls + llm_rubric
`conversational`	llm_rubric + natural_language_assert + state_check
`research`	llm_rubric + natural_language_assert + tool_calls
`computer_use`	state_check + tool_calls + llm_rubric

See Data/DomainPatterns.yaml for full configurations.

Task Schema (YAML)

task:
  id: "fix-auth-bypass_1"
  description: "Fix authentication bypass when password is empty"
  type: regression  # or capability
  domain: coding

  graders:
    - type: binary_tests
      required: [test_empty_pw.py]
      weight: 0.30

    - type: tool_calls
      weight: 0.20
      params:
        sequence: [read_file, edit_file, run_tests]

    - type: llm_rubric
      weight: 0.50
      params:
        rubric: prompts/security_review.md

  trials: 3
  pass_threshold: 0.75

Resource Index

Resource	Purpose
`Types/index.ts`	Core type definitions
`Graders/CodeBased/`	Deterministic graders
`Graders/ModelBased/`	LLM-powered graders
`Tools/TranscriptCapture.ts`	Capture agent trajectories
`Tools/TrialRunner.ts`	Multi-trial execution with pass@k
`Tools/SuiteManager.ts`	Suite management and saturation
`Tools/FailureToTask.ts`	Convert failures to test tasks
`Tools/AlgorithmBridge.ts`	ALGORITHM integration
`Data/DomainPatterns.yaml`	Domain-specific grader configs

Key Principles (from Anthropic)

Start with 20-50 real failures - Don't overthink, capture what actually broke
Unambiguous tasks - Two experts should reach identical verdicts
Balanced problem sets - Test both "should do" AND "should NOT do"
Grade outputs, not paths - Don't penalize valid creative solutions
Calibrate LLM judges - Against human expert judgment
Check transcripts regularly - Verify graders work correctly
Monitor saturation - Graduate to regression when hitting 95%+
Build infrastructure early - Evals shape how quickly you can adopt new models

ALGORITHM: Evals is a verification method
Science: Evals implements scientific method
Browser: For visual verification graders

GitHub リポジトリ

majiayu000/claude-skill-registry

パス: skills/data/Evals