evolve-skill-from-traces
À propos
Cette compétence améliore automatiquement la documentation SKILL.md en analysant les traces d'exécution de l'agent. Elle utilise un pipeline multi-agent pour proposer des modifications à partir des succès et des erreurs, puis les fusionne en une mise à jour sans conflits. Les développeurs doivent l'utiliser pour affiner itérativement les compétences en se basant sur les schémas d'utilisation réels.
Installation rapide
Claude Code
Recommandénpx skills add pjt222/agent-almanac -a claude-code/plugin add https://github.com/pjt222/agent-almanacgit clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/evolve-skill-from-tracesCopiez et collez cette commande dans Claude Code pour installer cette compétence
Documentation
Evolve a Skill from Execution Traces
Turn raw agent execution traces into checked SKILL.md through three-stage pipeline: trajectory grab, parallel multi-agent patch propose, and conflict-free consolidate. This skill bridges gap between watched agent behavior and logged procedures, turning success runs into repeatable skills.
When Use
- Execution traces show repeat patterns not caught in existing skills
- Watched agent behavior beats logged procedure
- Building skills from scratch by recording expert demos
- Many agents propose clashing fixes to same skill
Inputs
- Required:
traces-- set of agent session logs or session transcripts (min 10 success runs advised) - Required:
target_skill-- path to existing SKILL.md to evolve, or"new"for skill pull from scratch - Optional:
analyst_count-- count of parallel analyst agents to spawn (default: 4) - Optional:
held_out_ratio-- share of traces held for check, not used in draft (default: 0.2)
Steps
Step 1: Collect Execution Traces
Grab agent session logs, tool-call sequences, or conversation transcripts that show target behavior. Filter for runs tagged success. Normalize into standard trace format: sequence of (state, action, outcome) triples with timestamps.
- Spot trace source: session logs, tool-call history, or conversation exports
- Filter traces by success criteria (exit code 0, task done flag, user confirm)
- Normalize each trace into list of structured triples:
trace_entry:
state: <context before the action>
action: <tool call, command, or decision made>
outcome: <result, output, or state change>
timestamp: <ISO 8601>
- Split traces: hold
held_out_ratio(default 20%) for check in Step 7, use rest for Steps 2-6
# Example: count available traces and compute partition
total_traces=$(ls traces/*.json | wc -l)
held_out=$(echo "$total_traces * 0.2 / 1" | bc)
drafting=$((total_traces - held_out))
echo "Drafting: $drafting traces, Held-out: $held_out traces"
Got: Normalized trace set split into drafting (80%) and held-out (20%) subsets. Each trace entry has state, action, outcome, timestamp fields.
If fail: Fewer than 10 success traces? Grab more before go on. Small trace sets make overfit skills that fail on new inputs. Traces lack timestamps? Give ordinal sequence numbers instead.
Step 2: Cluster Trajectories
Group normalized traces by outcome pattern. Spot invariant core (steps in all success trajectories) vs variant branches (steps that differ across runs). Invariant core becomes skeleton for skill proc.
- Align traces by action type -- map each trace to sequence of action labels
- Find longest common subsequence across all traces to spot invariant core
- Sort other actions as variant branches, note which traces have them and under what cond
- Record branch frequency: what percent of success traces have each variant step
invariant_core:
- action: "read_input_file"
frequency: 100%
- action: "validate_schema"
frequency: 100%
- action: "transform_data"
frequency: 100%
variant_branches:
- action: "retry_on_timeout"
frequency: 35%
condition: "network latency > 2s"
- action: "fallback_to_cache"
frequency: 15%
condition: "API returns 503"
Got: Clear split between invariant core actions (in all success traces) and variant branches (cond, in subset). Each variant branch has frequency count and trigger cond.
If fail: No invariant core shows up (traces too different)? Target behavior maybe many distinct skills. Split traces into coherent subgroups by outcome type and handle each group apart.
Step 3: Draft Skill Skeleton
From invariant core, make first SKILL.md with frontmatter, When to Use (from entry conds across traces), Inputs (params that varied across runs), and Procedure section with one step per invariant action.
- Pull entry conds from first state of each trace to fill When to Use
- Spot params that varied across runs (file paths, thresholds, options) to fill Inputs
- Make one proc step per invariant core action, using most common phrasing across traces
- Add placeholder Expected/On failure blocks based on watched outcomes
# Scaffold the skeleton if creating a new skill
mkdir -p skills/<skill-name>/
# Skeleton structure
## When to Use
- <derived from common entry conditions>
## Inputs
- **Required**: <parameters present in all traces>
- **Optional**: <parameters present in some traces>
## Procedure
### Step N: <invariant action label>
<most common implementation from traces>
**Expected:** <most common success outcome>
**On failure:** <placeholder -- refined in Steps 4-6>
Got: Syntactically valid SKILL.md skeleton with frontmatter, When to Use, Inputs, Procedure section having one step per invariant core action. Expected blocks show watched outcomes; On failure blocks are placeholders.
If fail: Skeleton over 500 lines before adding variant branches? Invariant core too fine. Merge adjacent actions that always occur together into one step. Target 5-10 proc steps.
Step 4: Parallel Multi-Agent Patch Proposal
Spawn N analyst agents (advise 4-6), each reviewing full trace set vs draft skeleton from different analytical lens. Each agent makes structured patch: section, old text, new text, reason.
Give one lens per analyst:
| Analyst | Lens | Focus |
|---|---|---|
| 1 | Correctness | Does the skeleton capture all success paths? Are any invariant steps missing? |
| 2 | Efficiency | Are there redundant steps? Can any steps be merged or parallelized? |
| 3 | Robustness | Which failure modes are unhandled? What should On failure blocks contain? |
| 4 | Edge Cases | Which variant branches should become conditional steps or pitfalls? |
| 5 (optional) | Clarity | Is each step unambiguous? Can an agent follow it mechanically? |
| 6 (optional) | Generalizability | Are there trace-specific artifacts that should be abstracted? |
Each analyst agent gets:
- Draft skeleton from Step 3
- Full drafting trace set (not held-out)
- Their lens and focus questions
Each analyst returns list of structured patches:
patch:
analyst: "robustness"
section: "Procedure > Step 3"
old_text: "**On failure:** <placeholder>"
new_text: "**On failure:** If the API returns 503, wait 5 seconds and retry up to 3 times. If retries are exhausted, fall back to the cached response from the previous successful run."
rationale: "Traces #4, #7, #12 show 503 errors resolved by retry. Trace #15 shows cache fallback when retries fail."
supporting_traces: [4, 7, 12, 15]
Got: Each analyst returns 3-10 structured patches with section refs, old/new text, reason, support trace IDs. All patches collected into single patch set.
If fail: Analyst returns no patches? Their lens maybe not apply to this skill. This is OK -- not every lens shows issues. Analyst returns vague patches with no trace refs? Reject and re-prompt with rule for concrete supporting_traces.
Step 5: Detect and Classify Conflicts
Compare all patches from Step 4 for overlap edits. Sort each pair of overlap patches into one of three categories.
- Index patches by target section
- For patches on same section, compare old_text and new_text
- Sort each overlap:
| Conflict Type | Definition | Resolution |
|---|---|---|
| Compatible | Different sections, no overlap | Merge directly |
| Complementary | Same section, additive (both add content, no contradiction) | Combine text |
| Contradictory | Same section, mutually exclusive (one adds X, other removes X or adds Y instead) | Needs resolution in Step 6 |
conflict_report:
total_patches: 24
compatible: 18
complementary: 4
contradictory: 2
contradictions:
- section: "Procedure > Step 5"
patch_a: {analyst: "efficiency", action: "remove step"}
patch_b: {analyst: "robustness", action: "add retry logic"}
supporting_traces_a: [2, 8, 11]
supporting_traces_b: [4, 7, 12, 15]
Got: Conflict report listing all patch pairs, their sort, and for contradictions, support trace counts for each side.
If fail: Sort vague (patch both adds and changes text in same section)? Split into two patches: one additive, one modifying. Re-sort smaller patches.
Step 6: Consolidate Patches
Merge all patches into one consolidated SKILL.md with three-tier resolve strategy.
- Compatible patches: Apply direct -- these touch different sections and cannot clash
- Complementary patches: Combine new_text from both patches into one coherent block, keeping both contributions
- Contradictory patches: Resolve with prevalence-weighting:
- Count how many traces back each variant
- Prefer patch matched with more traces
- Tied (or within 10% of each other)? Use
argumentationskill to check which patch better serves skill's stated purpose - Log rejected alternative as Common Pitfall or note in right On failure block
consolidation_log:
applied_directly: 18
combined: 4
resolved_by_prevalence: 1
resolved_by_argumentation: 1
rejected_alternatives_documented: 2
After consolidate, check the SKILL.md:
- All sections present (When to Use, Inputs, Procedure, Validation, Common Pitfalls, Related Skills)
- Every proc step has Expected and On failure
- No dup or clashing instructions left
- Line count within 500-line limit
Got: Single consolidated SKILL.md holding patches from all analysts. Clashes resolved with logged reason. Rejected alternative for each clash shows as pitfall or note.
If fail: Consolidate gives internally clashing doc (e.g., Step 3 assumes file exists but Step 2 was removed by efficiency patch)? Revert clashing edit and keep original skeleton text for that section. Flag clash for manual review.
Step 7: Validate and Register
Run consolidated skill mentally vs held-out traces (20% held in Step 1). Check Expected/On failure blocks match watched outcomes in traces skill never seen.
- For each held-out trace, walk through skill proc step by step
- At each step, compare skill Expected outcome vs trace real outcome
- Record matches and mismatches:
validation_results:
held_out_traces: 5
full_match: 4
partial_match: 1
no_match: 0
mismatches:
- trace_id: 23
step: 4
expected: "API returns 200"
actual: "API returns 429 (rate limited)"
action: "Add rate-limit handling to On failure block"
- Mismatch rate over 20%? Go back to Step 4 with mismatched traces added to drafting set
- If skill is new, follow
create-skillfor directory make, registry entry, and symlink setup - If evolving existing skill, follow
evolve-skillfor version bump and translation sync
# Final validation: line count
lines=$(wc -l < skills/<skill-name>/SKILL.md)
[ "$lines" -le 500 ] && echo "OK ($lines lines)" || echo "FAIL: $lines lines > 500"
Got: At least 80% of held-out traces match skill proc end-to-end. Skill registered in skills/_registry.yml with right meta.
If fail: Check fails (>20% mismatch)? Skill overfit to drafting traces. Add mismatched traces to drafting set and re-run from Step 2. Check keeps fail after two rounds? Behavior maybe too variable for single skill -- think split into many skills by outcome type.
Validation
- At least 10 success traces grabbed before draft
- Traces split into drafting (80%) and held-out (20%) subsets
- Invariant core and variant branches logged clear
- At least 4 analyst agents reviewed skeleton from distinct lenses
- All patch clashes sorted (compatible, complementary, contradictory)
- Contradictory patches resolved with logged reason
- Consolidated SKILL.md has all required sections with Expected/On failure pairs
- Held-out check hits at least 80% match rate
- Line count within 500-line limit
- Skill registered (new) or version-bumped (existing) per standard procs
Pitfalls
- Too few traces: With fewer than 10 success runs, pattern pull unreliable. Invariant core may hold slip steps, and variant branches will lack enough frequency data. Grab more traces before start.
- Overfit to trace artifacts: Tool-specific behaviors (e.g., particular API client retry pattern) may not generalize. During Step 3, abstract tool-specific actions into tool-agnostic desc. Skill should say what to do, not which tool to use.
- Ignore failure traces: Failure traces show what skill should warn about in On failure blocks. During Step 1, also grab failed runs and tag them. Use them in Step 4 when robustness analyst checks unhandled failure modes.
- Single-lens analysis: Using only 1-2 analysts misses key views. Efficiency analyst alone will strip away safety checks that robustness analyst would keep. Use at least 4 distinct lenses for balance.
- Merge clashing patches without resolve: Applying both sides of clash gives internally clashing skill (e.g., "do X" in one step and "skip X" in other). Always sort and resolve clashes clear in Step 6.
- Not checking vs held-out traces: With no held-out check, consolidated skill may fit drafting traces perfect but fail on new runs. Always hold 20% of traces and test final skill vs them.
See Also
evolve-skill-- simpler human-led evolution (complement: use when traces not open)create-skill-- for fresh-pulled skills not yet exist; used in Step 7 for registerreview-skill-format-- check after consolidate to ensure agentskills.io complianceargumentation-- used in Step 6 for resolving clashing patches when prevalence tiedverify-agent-output-- evidence trails for patch proposals; checks analyst outputs in Step 4
Dépôt GitHub
Compétences associées
content-collections
MétaCette compétence propose une configuration éprouvée en production pour Content Collections, un outil axé sur TypeScript qui transforme des fichiers Markdown/MDX en collections de données typées de manière sûre avec une validation Zod. Utilisez-la lors de la création de blogs, de sites de documentation ou d'applications Vite + React riches en contenu pour garantir la sécurité de typage et la validation automatique du contenu. Elle couvre tout, de la configuration du plugin Vite et de la compilation MDX à l'optimisation des déploiements et la validation des schémas.
polymarket
MétaCette compétence permet aux développeurs de créer des applications avec la plateforme de marchés prédictifs Polymarket, incluant l'intégration d'API pour le trading et les données de marché. Elle fournit également une diffusion de données en temps réel via WebSocket pour surveiller les transactions en direct et l'activité du marché. Utilisez-la pour mettre en œuvre des stratégies de trading ou pour créer des outils traitant les mises à jour de marché en direct.
creating-opencode-plugins
MétaCette compétence aide les développeurs à créer des plugins OpenCode qui s'interconnectent avec plus de 25 types d'événements tels que les commandes, les fichiers et les opérations LSP. Elle fournit la structure du plugin, les spécifications de l'API événementielle et les modèles d'implémentation pour les modules JavaScript/TypeScript. Utilisez-la lorsque vous avez besoin d'intercepter, de surveiller ou d'étendre le cycle de vie de l'assistant IA OpenCode avec une logique personnalisée pilotée par les événements.
sglang
MétaSGLang est un framework de service LLM haute performance spécialisé dans la génération rapide et structurée pour les workflows JSON, regex et agentiques grâce à son cache de préfixe RadixAttention. Il offre une inférence nettement plus rapide, particulièrement pour les tâches avec des préfixes répétés, ce qui le rend idéal pour les sorties complexes et structurées ainsi que les conversations multi-tours. Choisissez SGLang plutôt que des alternatives comme vLLM lorsque vous avez besoin d'un décodage contraint ou que vous construisez des applications avec un partage étendu de préfixes.
