SKILL·679F42

verify-agent-output

Name: verify-agent-output
Author: pjt222

pjt222

업데이트됨 1 month ago

8 조회

메타aiautomationdesign

정보

이 스킬은 다중 에이전트 워크플로우에서 에이전트 간 전달되는 작업물을 검증하고 증거 추적 기록을 구축합니다. 실행 전, 중, 후에 구조화된 검증을 제공하며, 요약의 충실도 확인과 외부 기준에 대한 검증을 포함합니다. 에이전트 간 작업 인계를 감사하고, 외부로 공개할 출력물을 생성하며, 요약이 원본 자료를 정확히 반영하도록 보장하기 위해 사용하세요.

빠른 설치

Claude Code

문서

Verify Agent Output

Establish verifiable delivery between agents. Agent output → another agent or human → handoff needs more than "looks good." Define checkable expectations before work, generate evidence as side effect, validate vs external anchors not self-assessment. Core: fidelity can't be measured internally. Agent can't reliably verify own compressed out → verification needs external ref.

Use When

Multi-agent workflow → hands deliverables A → B
Agent produces external-facing out (reports, code, deployments) → human relies
Agent summarizes|compresses|transforms data → summary must faithfully represent source
Team coord pattern → structured handoff valid between members
Need to establish trust boundaries → what verify vs trust
Audit trail required for compliance|reproducibility

In

Required: Deliverable to verify (file, artifact, report, structured out)
Required: Expected outcome spec (what "done" looks like)
Optional: Source material (fidelity checks on summaries|transforms)
Optional: Trust boundary class (cross-agent, external-facing, internal)
Optional: Verification depth (spot-check, full, sample-based)

Do

Step 1: Define Expected Outcome Spec

Before exec, write what "done" looks like → concrete checkable conditions. Avoid subjective ("good quality") → verifiable assertions.

Categories:

Existence: File at path, endpoint responds, record present in DB
Shape: Out has N cols, JSON matches schema, fn has expected sig
Content: Val in range, str matches pattern, list contains required
Behavior: Test suite passes, cmd exits 0, API returns expected status
Consistency: Out hash matches in hash, row count preserved after transform, totals reconcile

Example spec:

expected_outcome:
  existence:
    - path: "output/report.html"
    - path: "output/data.csv"
  shape:
    - file: "output/data.csv"
      columns: ["id", "name", "score", "grade"]
      min_rows: 100
  content:
    - file: "output/data.csv"
      column: "score"
      range: [0, 100]
    - file: "output/report.html"
      contains: ["Summary", "Methodology", "Results"]
  behavior:
    - command: "Rscript -e 'testthat::test_dir(\"tests\")'"
      exit_code: 0
  consistency:
    - check: "row_count"
      source: "input/raw.csv"
      target: "output/data.csv"
      tolerance: 0

Got: Written spec w/ 1+ checkable condition per deliverable. Every condition machine-verifiable (script|cmd, not reading + judging).

If err: Can't state concretely → task underspecified. Push back on definition before proceed → vague expectations → unverifiable work.

Step 2: Evidence Trail During Exec

Work proceeds → emit structured evidence as side effect. Evidence trail not separate verification step → produced by exec itself.

Evidence types:

evidence:
  timing:
    started_at: "2026-03-12T10:00:00Z"
    completed_at: "2026-03-12T10:04:32Z"
    duration_seconds: 272
  checksums:
    - file: "output/data.csv"
      sha256: "a1b2c3..."
    - file: "output/report.html"
      sha256: "d4e5f6..."
  test_results:
    total: 24
    passed: 24
    failed: 0
    skipped: 0
  diff_summary:
    files_changed: 3
    insertions: 47
    deletions: 12
  tool_versions:
    r: "4.5.2"
    testthat: "3.2.1"

Practical cmds:

# Checksums
sha256sum output/data.csv output/report.html > evidence/checksums.txt

# Row counts
wc -l < input/raw.csv > evidence/input_rows.txt
wc -l < output/data.csv > evidence/output_rows.txt

# Test results (R)
Rscript -e "results <- testthat::test_dir('tests'); cat(format(results))" > evidence/test_results.txt

# Git diff summary
git diff --stat HEAD~1 > evidence/diff_summary.txt

# Timing (wrap the actual command)
start_time=$(date +%s)
# ... do the work ...
end_time=$(date +%s)
echo "duration_seconds: $((end_time - start_time))" > evidence/timing.txt

Got: evidence/ dir (or structured log) w/ checksums + timing per produced artifact. Evidence generated as part of work, not reconstructed.

If err: Evidence gen interferes w/ exec → capture what you can w/o blocking. Min: record file checksums after completion → enables later verify even if real-time not captured.

Step 3: Validate Deliverables vs Expected

After exec, check vs spec from Step 1. External anchors — tests, schemas, checksums, row counts — not asking producer "is this correct?"

Validation checks by category:

# Existence
for file in output/report.html output/data.csv; do
  test -f "$file" && echo "PASS: $file exists" || echo "FAIL: $file missing"
done

# Shape (CSV column check)
head -1 output/data.csv | tr ',' '\n' | sort > /tmp/actual_cols.txt
echo -e "grade\nid\nname\nscore" > /tmp/expected_cols.txt
diff /tmp/expected_cols.txt /tmp/actual_cols.txt && echo "PASS: columns match" || echo "FAIL: column mismatch"

# Row count
actual_rows=$(wc -l < output/data.csv)
[ "$actual_rows" -ge 101 ] && echo "PASS: $actual_rows rows (>= 100 + header)" || echo "FAIL: only $actual_rows rows"

# Content range check (R)
Rscript -e '
  d <- read.csv("output/data.csv")
  stopifnot(all(d$score >= 0 & d$score <= 100))
  cat("PASS: all scores in [0, 100]\n")
'

# Behavior
Rscript -e "testthat::test_dir('tests')" && echo "PASS: tests pass" || echo "FAIL: tests fail"

# Consistency (row count preserved)
input_rows=$(wc -l < input/raw.csv)
output_rows=$(wc -l < output/data.csv)
[ "$input_rows" -eq "$output_rows" ] && echo "PASS: row count preserved" || echo "FAIL: $input_rows -> $output_rows"

Got: All checks pass. Results recorded as structured out (PASS/FAIL per condition) alongside evidence trail Step 2.

If err: Don't silently accept partial passes. Any FAIL → triggers structured disagreement Step 6. Record passed + failed → partial results still valuable evidence.

Step 4: Fidelity Checks on Compressed Outs

Agent summarizes|compresses|transforms → out smaller than input by design. Summary can't be verified by reading alone → must compare vs source. Sample-based spot checks → verify fidelity.

Procedure:

Random sample from source (3-5 items spot, 10% thorough)
Per sampled item → verify accurately represented in compressed out
Check fabricated content → items in out w/ no source

# Example: verify a summary report against source data

# 1. Select random rows from source
shuf -n 5 input/raw.csv > /tmp/sample.csv

# 2. For each sampled row, verify it appears correctly in the output
while IFS=, read -r id name score grade; do
  grep -q "$id" output/report.html && echo "PASS: $id found in report" || echo "FAIL: $id missing from report"
done < /tmp/sample.csv

# 3. Check for fabricated IDs in the output
# Extract IDs from output, verify each exists in source
grep -oP 'id="[^"]*"' output/report.html | while read -r output_id; do
  grep -q "$output_id" input/raw.csv && echo "PASS: $output_id has source" || echo "FAIL: $output_id fabricated"
done

Text summaries → exact match impossible → verify key claims:

Quoted stats match source data
Named entities mentioned exist in source
Causal claims|rankings supported by underlying data
No items in summary absent from source

Got: All sampled items accurately represented. No fabricated content. Key stats in summary match computed vals from source.

If err: Fidelity fails → summary can't be trusted. Report specific discrepancies via structured disagreement Step 6. Producer must re-derive from source, not patch existing.

Step 5: Classify Trust Boundaries

Not everything needs verification. Over-verification its own cost → slows exec, complexity, false confidence. Classify outs by trust → focus where matters.

Boundary	Verification Required	Examples
Cross-agent handoff	Yes — always	Agent A produces data that Agent B consumes; team member passes deliverable to lead
External-facing output	Yes — always	Reports delivered to humans, deployed code, published packages, API responses
Compressed/summarized	Yes — sample-based	Any output that is smaller than its input by design (summaries, aggregations, extracts)
Internal intermediate	No — trust with checksums	Temporary files, intermediate computation results, internal state between steps
Idempotent operations	No — verify once	Config file writes, deterministic transforms, pure functions with known inputs

Apply proportionally:

Cross-agent: Full validation vs expected outcome (Step 3)
External-facing: Full validation + fidelity checks if summarized (Steps 3-4)
Internal intermediates: Checksums only (Step 2) → verify on demand if downstream fails
Idempotent ops: Verify on first exec, trust on repeat

Got: Each deliverable classified into trust boundary. Verification effort concentrated on cross-agent + external-facing.

If err: When in doubt, verify. Cost of false trust (accepting bad out) almost always > cost of unnecessary verification. Default verify, relax only w/ evidence boundary safe.

Step 6: Report Structured Disagreements on Fail

Verification fails → structured disagreement, not silently accept|reject. Structured = actionable → tells producer (or human) exactly what expected, received, gap.

Format:

verification_result: FAIL
deliverable: "output/data.csv"
timestamp: "2026-03-12T10:04:32Z"
failures:
  - check: "row_count"
    expected: 500
    actual: 487
    severity: warning
    note: "13 rows dropped — investigate filter logic"
  - check: "score_range"
    expected: "[0, 100]"
    actual: "[-3, 100]"
    severity: error
    note: "3 negative scores found — data validation missing"
  - check: "column_presence"
    expected: "grade"
    actual: null
    severity: error
    note: "grade column missing from output"
passes:
  - check: "file_exists"
  - check: "checksum_stable"
  - check: "test_suite"
recommendation: >
  Re-run with input validation enabled. The score_range and column_presence
  failures suggest the transform step is not handling edge cases. Do not
  patch the output — fix the transform and re-execute from source.

Principles:

Specific: "3 negative scores in rows 42, 187, 301" not "some values wrong"
Both expected + actual: Gap is what matters
Classify severity: error (blocks accept), warning (accept w/ caveat), info (noted)
Recommend action: Fix-and-rerun vs accept-w/-caveat vs reject
Never silently accept: Social trust ("other agent said it's fine") = attack vector. Trust evidence, not assertion.

Got: Every verification fail → structured disagreement w/ min: failed check, expected, actual, severity.

If err: Verification process itself fails (validation script errors out) → meta-failure. Inability to verify = finding → deliverable unverifiable in current form, worse than known fail.

Check

Traps

Verifying out by asking producer: Agent can't reliably verify own work. "I checked, looks correct" ≠ verification. External anchors (tests, checksums, schemas) = verification. Fidelity can't be measured internally.
Over-verify internal intermediates: Verifying every temp file + intermediate adds overhead w/o reliability. Classify boundaries (Step 5) → focus on cross-agent + external.
Subjective expected outcomes: "Report should be high quality" not checkable. "Report contains Summary, Methodology, Results, all cited stats match computed vals from source" checkable. Can't write check → can't verify.
Post-hoc evidence reconstruction: Generating evidence after fact ("let me checksum what I think I produced") unreliable. Evidence must be side effect of exec, captured real time. Reconstructed proves only what exists now, not what was produced.
Verification as infallible: Verification itself can have bugs. Passing test suite ≠ code correct → satisfies tests. Keep proportional + acknowledge limits, not green checks as absolute truth.
Silently accept partial passes: 9 of 10 pass → deliverable still fails. Report 1 fail as structured disagreement. Partial credit for grading; delivery binary.
Social trust as substitute: "Agent A reliable, skip verification" = attack vector. Trust w/o verification exploitable. Verify based on boundary, not producer reputation.
Wrong R binary on hybrid systems: WSL|Docker → Rscript may resolve to cross-platform wrapper, not native R. which Rscript && Rscript --version. Prefer native R binary (/usr/local/bin/Rscript Linux/WSL) for reliability. See Setting Up Your Environment for R path config.

→

fail-early-pattern — complementary: fail-early catches bad input at start; verify-agent-output catches bad out at end
security-audit-codebase — overlapping: security audits verify code meets security expectations, specific case of deliverable validation
honesty-humility — complementary: honest agents acknowledge uncertainty → verification gaps visible
review-skill-format — verify-agent-output can validate produced SKILL.md meets format reqs, concrete instance of deliverable validation
create-team — teams coordinating multi agents benefit from structured handoff valid at each coord step
test-team-coordination — tests whether team handoffs produce verifiable deliverables, exercising this skill end to end

GitHub 저장소

pjt222/agent-almanac

경로: i18n/caveman-ultra/skills/verify-agent-output

agentsagentskillsai-assisted-developmentclaude-codeskillsteams

FAQ

Frequently asked questions

What is the verify-agent-output skill?

verify-agent-output is a Claude Skill by pjt222. Skills package instructions and resources that Claude loads on demand, so Claude can perform verify-agent-output-related tasks without extra prompting.

How do I install verify-agent-output?

Use the install commands on this page: add verify-agent-output to Claude Code as a plugin, or clone its repository into your skills directory, then restart Claude so it picks up the skill.

What category does verify-agent-output belong to?

verify-agent-output is in the Meta category, tagged ai, automation and design.

Is verify-agent-output free to use?

Yes. verify-agent-output is listed on AIMCP and free to install. It runs inside Claude, so no separate service account is required to use the skill itself.

연관 스킬

content-collections

메타

이 스킬은 콘텐츠 콜렉션(Content Collections)을 위한 프로덕션 검증된 설정을 제공합니다. 콘텐츠 콜렉션은 Markdown/MDX 파일을 Zod 검증이 포함된 타입 안전한 데이터 콜렉션으로 변환해주는 TypeScript 최우선 도구입니다. 블로그, 문서 사이트 또는 콘텐츠 중심의 Vite + React 애플리케이션을 구축할 때 타입 안전성과 자동 콘텐츠 검증을 보장하기 위해 사용하세요. Vite 플러그인 구성과 MDX 컴파일부터 배포 최적화 및 스키마 검증에 이르기까지 모든 것을 다룹니다.

스킬 보기

polymarket

메타

이 스킬은 개발자들이 Polymarket 예측 시장 플랫폼을 활용한 애플리케이션을 구축할 수 있도록 지원하며, 거래 및 시장 데이터를 위한 API 통합 기능을 포함합니다. 또한 WebSocket을 통한 실시간 데이터 스트리밍을 제공하여 실시간 거래와 시장 활동을 모니터링할 수 있습니다. 이를 통해 거래 전략을 구현하거나 실시간 시장 업데이트를 처리하는 도구를 생성하는 데 활용할 수 있습니다.

스킬 보기

creating-opencode-plugins

메타

이 스킬은 개발자들이 명령어, 파일, LSP 작업 등 25개 이상의 이벤트 유형에 연결되는 OpenCode 플러그인을 만들 수 있도록 돕습니다. JavaScript/TypeScript 모듈을 위한 플러그인 구조, 이벤트 API 명세, 구현 패턴을 제공합니다. OpenCode AI 어시스턴트의 라이프사이클을 사용자 정의 이벤트 기반 로직으로 가로채거나, 모니터링하거나, 확장해야 할 때 사용하세요.

스킬 보기

sglang

메타

SGLang은 RadixAttention 프리픽스 캐싱을 활용하여 JSON, 정규식, 에이전트 워크플로우를 위한 고속 구조화 생성에 특화된 고성능 LLM 서빙 프레임워크입니다. 특히 반복되는 프리픽스가 있는 작업에서 상당히 빠른 추론 속도를 제공하여 복잡한 구조화 출력 및 다중 턴 대화에 이상적입니다. 제약 디코딩이 필요하거나 광범위한 프리픽스 공유가 있는 애플리케이션을 구축할 때는 vLLM과 같은 대안보다 SGLang을 선택하십시오.

스킬 보기