test-team-coordination
정보
이 스킬은 팀의 협업 패턴을 검증하고 관찰하기 위해 테스트 시나리오를 실행하며, 수용 기준에 따라 평가합니다. 동일한 작업 부하에서의 성능을 비교하고 기준선을 설정하기 위해 구조화된 `RESULT.md` 보고서를 생성합니다. 실제 작업 중 팀의 협업이 예상된 행동을 만들어내는지 확인하는 데 사용하세요.
빠른 설치
Claude Code
추천npx skills add pjt222/agent-almanac -a claude-code/plugin add https://github.com/pjt222/agent-almanacgit clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/test-team-coordinationClaude Code에서 이 명령을 복사하여 붙여넣어 스킬을 설치하세요
문서
Test Team Coordination
Exec test scenario from tests/scenarios/teams/ vs target team. Observe coordination pattern behaviors, eval acceptance criteria, score rubric, produce RESULT.md in tests/results/.
Use When
- Validate team's coordination produces expected behaviors
- Run structured test after modifying team def | agent
- Compare patterns by running same scenario w/ diff teams
- Establish baseline perf metrics for team composition
- Regression tests after adding agents | changing membership
In
- Required: Path to test scenario file (e.g.
tests/scenarios/teams/test-opaque-team-cartographers-audit.md) - Optional: Run ID override (default:
YYYY-MM-DD-<target>-NNNauto) - Optional: Team size override (default: from scenario frontmatter)
- Optional: Skip scope change (default: false — inject if defined)
Do
Step 1: Load + Validate Scenario
1.1. Read scenario file specified in input.
1.2. Parse YAML frontmatter + extract:
target— team to testcoordination-pattern— expected patternteam-size— # members to spawn- Acceptance criteria table
- Scoring rubric (if present)
- Ground truth data (if present)
1.3. Verify file has all req sections:
- Objective
- Pre-conditions
- Task (w/ Primary Task subsection)
- Expected Behaviors
- Acceptance Criteria
- Observation Protocol
Got: Scenario loads, parses, has all req sections.
If err: Missing | unparseable → abort w/ err msg ID'ing missing/malformed. Optional sections (Rubric, Ground Truth, Variants) absent → note + continue.
Step 2: Verify Pre-conditions
2.1. Walk through each pre-condition checkbox.
2.2. File-existence → use Glob.
2.3. Registry count → parse _registry.yml + cmp total_* vs actual file counts.
2.4. Branch/git → git status --porcelain + git branch --show-current.
Got: All pre-conditions satisfied.
If err: Pre-condition fails → record BLOCKED. Decide: proceed (soft) | abort (hard like missing target team file). Doc decision.
Step 3: Load Coordination Pattern Criteria
3.1. Read tests/_registry.yml + locate coordination_patterns matching scenario's coordination-pattern.
3.2. Extract key_behaviors list.
3.3. Behaviors = observation checklist — each watched during exec + recorded observed/not.
Got: Pattern key behaviors loaded for observation.
If err: Pattern not in registry → use scenario's Expected Behaviors as sole source. Log warning.
Step 4: Execute Task
4.1. Create result dir: tests/results/YYYY-MM-DD-<target>-NNN/.
4.2. Record T0 (task start).
4.3. Read target team def from teams/<target>.md, extract CONFIG block, activate: call TeamCreate w/ team name, spawn teammates per subagent_type, create tasks from CONFIG tasks list. Use team-size from scenario. Pass Primary Task verbatim from scenario's Task section.
4.4. Observe team's exec phases. Record:
- T1: Form assessment / decomposition complete
- T2: Role assignments visible
4.5. Scenario defines Scope Change Trigger + skip-scope-change false:
- Wait until Phase 2 (role assignment) visible
- T3 (scope change injection)
- Send scope change prompt via SendMessage
- T4 (scope change absorbed — role adjustment visible)
4.6. Continue observing until output:
- T5 (integration begins)
- T6 (final report delivered)
4.7. Capture team's complete output.
Got: Team executes through coordination phases. Timestamps for all transitions. Scope change (if applicable) injected + absorbed.
If err: Team fails to produce output → record fail point + err msgs. Stalls → note last phase + timeout. Proceed to eval w/ partial.
Step 5: Evaluate Pattern Behaviors
5.1. Per key behavior from Step 3, determine observed during exec:
- Observed: Clear evidence in output | coordination
- Partial: Some evidence but incomplete | ambiguous
- Not observed: No evidence
5.2. Per task-specific behavior from scenario's Expected Behaviors, same eval.
5.3. Record findings in observation log.
Got: All/most pattern + task behaviors observed.
If err: Unobserved = findings, not test fails. Record accurate — pattern didn't fully manifest.
Step 6: Evaluate Acceptance Criteria
6.1. Walk each acceptance criterion.
6.2. Per criterion, determination:
- PASS: Clearly met w/ observable evidence
- PARTIAL: Partially met (counts toward threshold at 0.5 weight)
- FAIL: Not met despite opportunity
- BLOCKED: Couldn't eval (pre-condition fail, timeout)
6.3. Scenario has Ground Truth → verify findings vs:
- Calc accuracy % per category
- Flag false +/false -
6.4. Scenario has Scoring Rubric → score each dim 1-5 w/ brief justification.
6.5. Calc summary metrics:
- Acceptance: X/N criteria passed (PARTIAL = 0.5)
- Threshold: PASS if ≥ scenario threshold
- Rubric total: X/Y points (if applicable)
Got: All criteria have determination. Summary metrics calc'd.
If err: < half criteria evaluable (too many BLOCKED) → inconclusive. Doc why + recommend re-run after fixing pre-conditions.
Step 7: Generate RESULT.md
7.1. Create tests/results/YYYY-MM-DD-<target>-NNN/RESULT.md using Recording Template from scenario's Observation Protocol.
7.2. Populate all sections:
- Run metadata (observer, timestamps, duration)
- Phase log w/ all timestamps
- Role emergence log (adaptive/team tests)
- Acceptance criteria results table
- Rubric scores table (if applicable)
- Ground truth verification table (if applicable)
- Key observations (narrative)
- Lessons learned
7.3. Include team's raw output as appendix | separate file (team-output.md) in same dir.
7.4. Add summary verdict at top:
**Verdict**: PASS | FAIL | INCONCLUSIVE
**Score**: X/N criteria (Y/Z rubric points)
**Duration**: Xm
Got: Complete RESULT.md w/ all sections + clear verdict.
If err: Result file can't be written → output to stdout fallback. Eval data never lost.
Check
- Scenario loaded + all req sections present
- Pre-conditions verified (or BLOCKED)
- Pattern key behaviors loaded from registry
- Team spawned + task delivered
- Scope change injected at right time (if applicable)
- All pattern behaviors evaluated (observed/partial/not)
- All criteria have determination (PASS/PARTIAL/FAIL/BLOCKED)
- Ground truth verified (if applicable)
- RESULT.md generated w/ all sections
- Summary verdict calc'd + recorded
Traps
- Eval output quality vs coordination: Tests how team coordinates, not whether output perfect. Team coordinating well but finding 7/9 broken refs still demonstrates pattern.
- Inject scope change too early: Wait until role assignment clearly visible. Too early → team hasn't differentiated, nothing to adapt.
- Conflate member output w/ team output: Opaque team should present unified output. Individual member reports = finding about opacity, not test infra problem.
- Exact ground truth matching: Ground truth counts approximate. Eval right ballpark, not exact match.
- Forget timestamps: Essential for phase durations + adaptation speed. Set as events happen, not retroactively.
→
review-codebase— deep codebase review complementing team-level testingreview-skill-format— validates individual skill format (this validates team coordination)create-team— creates defs this testsevolve-team— evolves defs based on test findingstest-a2a-interop— similar testing pattern for A2A protocol conformanceassess-form— morphic assessment opaque team lead uses internally
GitHub 저장소
연관 스킬
evaluating-llms-harness
테스팅이 Claude Skill은 MMLU, GSM8K를 포함한 60개 이상의 표준화된 학술 과제에서 LLM 성능을 벤치마크하기 위해 lm-evaluation-harness를 실행합니다. 개발자들이 모델 품질을 비교하고, 학습 진행 상황을 추적하거나 학술 결과를 보고할 수 있도록 설계되었습니다. 이 도구는 HuggingFace와 vLLM 모델을 포함한 다양한 백엔드를 지원합니다.
cloudflare-cron-triggers
테스팅이 스킬은 cron 표현식을 사용하여 Worker를 스케줄링하기 위한 Cloudflare Cron Triggers 구현에 관한 포괄적인 지식을 제공합니다. 주기적 작업, 유지보수 작업, 자동화된 워크플로우 설정 방법을 다루며, 잘못된 cron 표현식이나 시간대 문제 같은 일반적인 이슈들을 해결하는 방법을 포함합니다. 개발자들은 이를 통해 스케줄된 핸들러 구성, cron 트리거 테스트, Workflows 및 Green Compute와의 연동 작업을 수행할 수 있습니다.
webapp-testing
테스팅이 Claude Skill은 Python 스크립트를 통해 로컬 웹 애플리케이션을 테스트하기 위한 Playwright 기반 툴킷을 제공합니다. 프론트엔드 검증, UI 디버깅, 스크린샷 캡처, 로그 확인 기능을 지원하며 서버 라이프사이클을 관리합니다. 브라우저 자동화 작업에 사용하되 컨텍스트 오염을 방지하기 위해 소스 코드를 읽지 않고 스크립트를 직접 실행하세요.
finishing-a-development-branch
테스팅이 스킬은 테스트 통과를 확인한 후 체계적인 통합 옵션을 제시하여 개발자가 완성된 작업을 마무리하도록 돕습니다. 구현이 완료된 후 머지, PR 생성, 브랜치 정리와 같은 워크플로우를 안내합니다. 코드가 준비되고 테스트가 완료되었을 때 개발 프로세스를 체계적으로 마무리하기 위해 사용하세요.
