evolve-skill-from-traces
关于
This skill automatically improves SKILL.md documentation by analyzing agent execution traces. It uses a multi-agent pipeline to propose edits from successes and errors, then merges them into a conflict-free update. Developers should use it to iteratively refine skills based on real usage patterns.
快速安装
Claude Code
推荐npx skills add pjt222/agent-almanac -a claude-code/plugin add https://github.com/pjt222/agent-almanacgit clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/evolve-skill-from-traces在 Claude Code 中复制并粘贴此命令以安装该技能
技能文档
Evolve a Skill from Execution Traces
Turn raw agent execution traces into checked SKILL.md through three-stage pipeline: trajectory grab, parallel multi-agent patch propose, and conflict-free consolidate. This skill bridges gap between watched agent behavior and logged procedures, turning success runs into repeatable skills.
When Use
- Execution traces show repeat patterns not caught in existing skills
- Watched agent behavior beats logged procedure
- Building skills from scratch by recording expert demos
- Many agents propose clashing fixes to same skill
Inputs
- Required:
traces-- set of agent session logs or session transcripts (min 10 success runs advised) - Required:
target_skill-- path to existing SKILL.md to evolve, or"new"for skill pull from scratch - Optional:
analyst_count-- count of parallel analyst agents to spawn (default: 4) - Optional:
held_out_ratio-- share of traces held for check, not used in draft (default: 0.2)
Steps
Step 1: Collect Execution Traces
Grab agent session logs, tool-call sequences, or conversation transcripts that show target behavior. Filter for runs tagged success. Normalize into standard trace format: sequence of (state, action, outcome) triples with timestamps.
- Spot trace source: session logs, tool-call history, or conversation exports
- Filter traces by success criteria (exit code 0, task done flag, user confirm)
- Normalize each trace into list of structured triples:
trace_entry:
state: <context before the action>
action: <tool call, command, or decision made>
outcome: <result, output, or state change>
timestamp: <ISO 8601>
- Split traces: hold
held_out_ratio(default 20%) for check in Step 7, use rest for Steps 2-6
# Example: count available traces and compute partition
total_traces=$(ls traces/*.json | wc -l)
held_out=$(echo "$total_traces * 0.2 / 1" | bc)
drafting=$((total_traces - held_out))
echo "Drafting: $drafting traces, Held-out: $held_out traces"
Got: Normalized trace set split into drafting (80%) and held-out (20%) subsets. Each trace entry has state, action, outcome, timestamp fields.
If fail: Fewer than 10 success traces? Grab more before go on. Small trace sets make overfit skills that fail on new inputs. Traces lack timestamps? Give ordinal sequence numbers instead.
Step 2: Cluster Trajectories
Group normalized traces by outcome pattern. Spot invariant core (steps in all success trajectories) vs variant branches (steps that differ across runs). Invariant core becomes skeleton for skill proc.
- Align traces by action type -- map each trace to sequence of action labels
- Find longest common subsequence across all traces to spot invariant core
- Sort other actions as variant branches, note which traces have them and under what cond
- Record branch frequency: what percent of success traces have each variant step
invariant_core:
- action: "read_input_file"
frequency: 100%
- action: "validate_schema"
frequency: 100%
- action: "transform_data"
frequency: 100%
variant_branches:
- action: "retry_on_timeout"
frequency: 35%
condition: "network latency > 2s"
- action: "fallback_to_cache"
frequency: 15%
condition: "API returns 503"
Got: Clear split between invariant core actions (in all success traces) and variant branches (cond, in subset). Each variant branch has frequency count and trigger cond.
If fail: No invariant core shows up (traces too different)? Target behavior maybe many distinct skills. Split traces into coherent subgroups by outcome type and handle each group apart.
Step 3: Draft Skill Skeleton
From invariant core, make first SKILL.md with frontmatter, When to Use (from entry conds across traces), Inputs (params that varied across runs), and Procedure section with one step per invariant action.
- Pull entry conds from first state of each trace to fill When to Use
- Spot params that varied across runs (file paths, thresholds, options) to fill Inputs
- Make one proc step per invariant core action, using most common phrasing across traces
- Add placeholder Expected/On failure blocks based on watched outcomes
# Scaffold the skeleton if creating a new skill
mkdir -p skills/<skill-name>/
# Skeleton structure
## When to Use
- <derived from common entry conditions>
## Inputs
- **Required**: <parameters present in all traces>
- **Optional**: <parameters present in some traces>
## Procedure
### Step N: <invariant action label>
<most common implementation from traces>
**Expected:** <most common success outcome>
**On failure:** <placeholder -- refined in Steps 4-6>
Got: Syntactically valid SKILL.md skeleton with frontmatter, When to Use, Inputs, Procedure section having one step per invariant core action. Expected blocks show watched outcomes; On failure blocks are placeholders.
If fail: Skeleton over 500 lines before adding variant branches? Invariant core too fine. Merge adjacent actions that always occur together into one step. Target 5-10 proc steps.
Step 4: Parallel Multi-Agent Patch Proposal
Spawn N analyst agents (advise 4-6), each reviewing full trace set vs draft skeleton from different analytical lens. Each agent makes structured patch: section, old text, new text, reason.
Give one lens per analyst:
| Analyst | Lens | Focus |
|---|---|---|
| 1 | Correctness | Does the skeleton capture all success paths? Are any invariant steps missing? |
| 2 | Efficiency | Are there redundant steps? Can any steps be merged or parallelized? |
| 3 | Robustness | Which failure modes are unhandled? What should On failure blocks contain? |
| 4 | Edge Cases | Which variant branches should become conditional steps or pitfalls? |
| 5 (optional) | Clarity | Is each step unambiguous? Can an agent follow it mechanically? |
| 6 (optional) | Generalizability | Are there trace-specific artifacts that should be abstracted? |
Each analyst agent gets:
- Draft skeleton from Step 3
- Full drafting trace set (not held-out)
- Their lens and focus questions
Each analyst returns list of structured patches:
patch:
analyst: "robustness"
section: "Procedure > Step 3"
old_text: "**On failure:** <placeholder>"
new_text: "**On failure:** If the API returns 503, wait 5 seconds and retry up to 3 times. If retries are exhausted, fall back to the cached response from the previous successful run."
rationale: "Traces #4, #7, #12 show 503 errors resolved by retry. Trace #15 shows cache fallback when retries fail."
supporting_traces: [4, 7, 12, 15]
Got: Each analyst returns 3-10 structured patches with section refs, old/new text, reason, support trace IDs. All patches collected into single patch set.
If fail: Analyst returns no patches? Their lens maybe not apply to this skill. This is OK -- not every lens shows issues. Analyst returns vague patches with no trace refs? Reject and re-prompt with rule for concrete supporting_traces.
Step 5: Detect and Classify Conflicts
Compare all patches from Step 4 for overlap edits. Sort each pair of overlap patches into one of three categories.
- Index patches by target section
- For patches on same section, compare old_text and new_text
- Sort each overlap:
| Conflict Type | Definition | Resolution |
|---|---|---|
| Compatible | Different sections, no overlap | Merge directly |
| Complementary | Same section, additive (both add content, no contradiction) | Combine text |
| Contradictory | Same section, mutually exclusive (one adds X, other removes X or adds Y instead) | Needs resolution in Step 6 |
conflict_report:
total_patches: 24
compatible: 18
complementary: 4
contradictory: 2
contradictions:
- section: "Procedure > Step 5"
patch_a: {analyst: "efficiency", action: "remove step"}
patch_b: {analyst: "robustness", action: "add retry logic"}
supporting_traces_a: [2, 8, 11]
supporting_traces_b: [4, 7, 12, 15]
Got: Conflict report listing all patch pairs, their sort, and for contradictions, support trace counts for each side.
If fail: Sort vague (patch both adds and changes text in same section)? Split into two patches: one additive, one modifying. Re-sort smaller patches.
Step 6: Consolidate Patches
Merge all patches into one consolidated SKILL.md with three-tier resolve strategy.
- Compatible patches: Apply direct -- these touch different sections and cannot clash
- Complementary patches: Combine new_text from both patches into one coherent block, keeping both contributions
- Contradictory patches: Resolve with prevalence-weighting:
- Count how many traces back each variant
- Prefer patch matched with more traces
- Tied (or within 10% of each other)? Use
argumentationskill to check which patch better serves skill's stated purpose - Log rejected alternative as Common Pitfall or note in right On failure block
consolidation_log:
applied_directly: 18
combined: 4
resolved_by_prevalence: 1
resolved_by_argumentation: 1
rejected_alternatives_documented: 2
After consolidate, check the SKILL.md:
- All sections present (When to Use, Inputs, Procedure, Validation, Common Pitfalls, Related Skills)
- Every proc step has Expected and On failure
- No dup or clashing instructions left
- Line count within 500-line limit
Got: Single consolidated SKILL.md holding patches from all analysts. Clashes resolved with logged reason. Rejected alternative for each clash shows as pitfall or note.
If fail: Consolidate gives internally clashing doc (e.g., Step 3 assumes file exists but Step 2 was removed by efficiency patch)? Revert clashing edit and keep original skeleton text for that section. Flag clash for manual review.
Step 7: Validate and Register
Run consolidated skill mentally vs held-out traces (20% held in Step 1). Check Expected/On failure blocks match watched outcomes in traces skill never seen.
- For each held-out trace, walk through skill proc step by step
- At each step, compare skill Expected outcome vs trace real outcome
- Record matches and mismatches:
validation_results:
held_out_traces: 5
full_match: 4
partial_match: 1
no_match: 0
mismatches:
- trace_id: 23
step: 4
expected: "API returns 200"
actual: "API returns 429 (rate limited)"
action: "Add rate-limit handling to On failure block"
- Mismatch rate over 20%? Go back to Step 4 with mismatched traces added to drafting set
- If skill is new, follow
create-skillfor directory make, registry entry, and symlink setup - If evolving existing skill, follow
evolve-skillfor version bump and translation sync
# Final validation: line count
lines=$(wc -l < skills/<skill-name>/SKILL.md)
[ "$lines" -le 500 ] && echo "OK ($lines lines)" || echo "FAIL: $lines lines > 500"
Got: At least 80% of held-out traces match skill proc end-to-end. Skill registered in skills/_registry.yml with right meta.
If fail: Check fails (>20% mismatch)? Skill overfit to drafting traces. Add mismatched traces to drafting set and re-run from Step 2. Check keeps fail after two rounds? Behavior maybe too variable for single skill -- think split into many skills by outcome type.
Validation
- At least 10 success traces grabbed before draft
- Traces split into drafting (80%) and held-out (20%) subsets
- Invariant core and variant branches logged clear
- At least 4 analyst agents reviewed skeleton from distinct lenses
- All patch clashes sorted (compatible, complementary, contradictory)
- Contradictory patches resolved with logged reason
- Consolidated SKILL.md has all required sections with Expected/On failure pairs
- Held-out check hits at least 80% match rate
- Line count within 500-line limit
- Skill registered (new) or version-bumped (existing) per standard procs
Pitfalls
- Too few traces: With fewer than 10 success runs, pattern pull unreliable. Invariant core may hold slip steps, and variant branches will lack enough frequency data. Grab more traces before start.
- Overfit to trace artifacts: Tool-specific behaviors (e.g., particular API client retry pattern) may not generalize. During Step 3, abstract tool-specific actions into tool-agnostic desc. Skill should say what to do, not which tool to use.
- Ignore failure traces: Failure traces show what skill should warn about in On failure blocks. During Step 1, also grab failed runs and tag them. Use them in Step 4 when robustness analyst checks unhandled failure modes.
- Single-lens analysis: Using only 1-2 analysts misses key views. Efficiency analyst alone will strip away safety checks that robustness analyst would keep. Use at least 4 distinct lenses for balance.
- Merge clashing patches without resolve: Applying both sides of clash gives internally clashing skill (e.g., "do X" in one step and "skip X" in other). Always sort and resolve clashes clear in Step 6.
- Not checking vs held-out traces: With no held-out check, consolidated skill may fit drafting traces perfect but fail on new runs. Always hold 20% of traces and test final skill vs them.
See Also
evolve-skill-- simpler human-led evolution (complement: use when traces not open)create-skill-- for fresh-pulled skills not yet exist; used in Step 7 for registerreview-skill-format-- check after consolidate to ensure agentskills.io complianceargumentation-- used in Step 6 for resolving clashing patches when prevalence tiedverify-agent-output-- evidence trails for patch proposals; checks analyst outputs in Step 4
GitHub 仓库
相关推荐技能
content-collections
元Content Collections 是一个 TypeScript 优先的构建工具,可将本地 Markdown/MDX 文件转换为类型安全的数据集合。它专为构建博客、文档站和内容密集型 Vite+React 应用而设计,提供基于 Zod 的自动模式验证。该工具涵盖从 Vite 插件配置、MDX 编译到生产环境部署的完整工作流。
polymarket
元这个Claude Skill为开发者提供完整的Polymarket预测市场开发支持,涵盖API调用、交易执行和市场数据分析。关键特性包括实时WebSocket数据流,可监控实时交易、订单和市场动态。开发者可用它构建预测市场应用、实施交易策略并集成实时市场预测功能。
creating-opencode-plugins
元该Skill帮助开发者创建OpenCode插件,用于接入命令、文件、LSP等25+种事件。它提供了插件结构、事件API规范和JavaScript/TypeScript实现模式,适合需要拦截操作、扩展功能或自定义事件处理的场景。开发者可通过它快速构建响应式模块来增强OpenCode AI助手的能力。
sglang
元SGLang是一个专为LLM设计的高性能推理框架,特别适用于需要结构化输出的场景。它通过RadixAttention前缀缓存技术,在处理JSON、正则表达式、工具调用等具有重复前缀的复杂工作流时,能实现极速生成。如果你正在构建智能体或多轮对话系统,并追求远超vLLM的推理性能,SGLang是理想选择。
