circuit-breaker-pattern
关于
This skill implements the circuit breaker pattern for agentic workflows to prevent cascading tool failures. It tracks tool health, manages state transitions, and routes calls to alternatives using capability maps when failures occur. Use it to build fault-tolerant agents that gracefully handle unreliable tools and recover from outages mid-task.
快速安装
Claude Code
推荐npx skills add pjt222/agent-almanac -a claude-code/plugin add https://github.com/pjt222/agent-almanacgit clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/circuit-breaker-pattern在 Claude Code 中复制并粘贴此命令以安装该技能
技能文档
Circuit Breaker Pattern
Graceful degradation when tools fail. Agent w/ 5 tools, 1 broken → don't fail whole → spot broken tool, stop calling, shrink scope to achievable, report honest what skipped. Codifies circuit breaker from distributed systems → agentic tool orchestration.
Core insight from kirapixelads' "Kitchen Fire Problem": expeditor (orch layer) must NOT cook. Separate what to attempt from how → orchestrator stays out of broken tool's retry loop.
Use When
- Building agents w/ many tools, varying reliability
- Fault-tolerant workflows → partial > total failure
- Agent stuck in retry loop on broken tool, not moving forward
- Mid-task tool outage → graceful recovery
- Hardening existing agents vs. cascading failures
- Stale/cached tool out being treated as fresh
In
- Required: Tool list (names + purposes)
- Required: Task to accomplish
- Optional: Known reliability issues / past fail patterns
- Optional: Fail threshold (default: 3 consecutive fails → open)
- Optional: Fail budget per cycle (default: 5 total fails → pause)
- Optional: Half-open probe interval (default: every 3rd attempt post-open)
Do
Step 1: Build Capability Map
Declare each tool's capability + alternatives. Map = foundation for scope reduction → w/o it, fail leaves agent guessing.
capability_map:
- tool: Grep
provides: content search across files
alternatives:
- tool: Bash
method: "rg or grep command"
degradation: "loses Grep's built-in output formatting"
- tool: Read
method: "read suspected files directly"
degradation: "requires knowing which files to check; no broad search"
fallback: "ask the user which files to examine"
- tool: Bash
provides: command execution, build tools, git operations
alternatives: []
fallback: "report commands that need to be run manually"
- tool: Read
provides: file content inspection
alternatives:
- tool: Bash
method: "cat or head command"
degradation: "loses line numbering and truncation safety"
fallback: "ask the user to paste file contents"
- tool: Write
provides: file creation
alternatives:
- tool: Edit
method: "create via full-file edit"
degradation: "requires file to already exist for Edit"
- tool: Bash
method: "echo/cat heredoc"
degradation: "loses Write's atomic file creation"
fallback: "output file contents for the user to save manually"
- tool: WebSearch
provides: external information retrieval
alternatives: []
fallback: "state what information is needed; ask user to provide it"
Each tool, doc:
- Capability (one line)
- Alternative tools (w/ degradation notes)
- Manual fallback when no tool alternative
→ Full map covers every tool agent uses. Each entry has fallback even if no tool alt. Map makes explicit what's implicit: critical tools (no alts) vs. substitutable.
If err: Tool list unclear → start w/ allowed-tools from skill frontmatter. Alts uncertain → mark degradation: "unknown — test before relying on this route" vs. omit.
Step 2: Initialize Circuit Breaker State
State tracker per tool. All tools start CLOSED (healthy).
Circuit Breaker State Table:
+------------+--------+-------------------+------------------+-----------------+
| Tool | State | Consecutive Fails | Last Failure | Last Success |
+------------+--------+-------------------+------------------+-----------------+
| Grep | CLOSED | 0 | — | — |
| Bash | CLOSED | 0 | — | — |
| Read | CLOSED | 0 | — | — |
| Write | CLOSED | 0 | — | — |
| Edit | CLOSED | 0 | — | — |
| WebSearch | CLOSED | 0 | — | — |
+------------+--------+-------------------+------------------+-----------------+
Failure budget: 0 / 5 consumed
State defs:
- CLOSED — Tool healthy. Use normally. Track consecutive fails.
- OPEN — Tool known-broken. Don't call. Route to alts or shrink scope.
- HALF-OPEN — Tool was broken, maybe recovered. Single probe call. Success → CLOSED. Fail → OPEN.
Transitions:
- CLOSED → OPEN: Consecutive fails ≥ threshold (default: 3)
- OPEN → HALF-OPEN: After interval (e.g., every 3rd task step)
- HALF-OPEN → CLOSED: Probe success
- HALF-OPEN → OPEN: Probe fail
→ State table init'd all tools CLOSED, zero fails. Threshold + budget declared.
If err: Can't enumerate tools upfront (dynamic discovery) → init state on first use. Pattern still works → build table incrementally.
Step 3: Implement Call-and-Track Loop
Agent needs tool call → follow decision seq. This = expeditor logic → decides whether to call, not how to execute.
BEFORE each tool call:
1. Check tool state in the circuit breaker table
2. If OPEN:
a. Check if it is time for a half-open probe
- Yes → transition to HALF-OPEN, proceed with probe call
- No → skip this tool, route to alternative (Step 4)
3. If HALF-OPEN:
a. Make one probe call
b. Success → transition to CLOSED, reset consecutive fails to 0
c. Failure → transition to OPEN, increment failure budget
4. If CLOSED:
a. Make the call normally
AFTER each tool call:
1. Success:
- Reset consecutive fails to 0
- Record last success timestamp
2. Failure:
- Increment consecutive fails
- Record last failure timestamp and error message
- Increment failure budget consumed
- If consecutive fails >= threshold:
transition to OPEN
log: "Circuit OPENED for [tool]: [failure count] consecutive failures"
- If failure budget exhausted:
PAUSE — do not continue the task
Report to user (Step 6)
Expeditor NEVER retries failed call immediately. Record fail, check thresholds, move on. Retries via HALF-OPEN probe at later step only.
→ Clear decision loop before + after every tool call. Tool health tracked continuous. Expeditor layer never blocks on failing tool.
If err: Tracking state across calls impractical (stateless exec) → degrade to simpler: count total fails, pause at budget. Three-state breaker = ideal; fail counter = min viable.
Step 4: Route to Alternatives on Open Circuit
Tool OPEN → consult capability map (Step 1), route to best alt.
Routing priority:
- Tool alt, low degradation — Similar capability tool. Note degradation in out.
- Tool alt, high degradation — Big capability loss. Label what's missing.
- Manual fallback — Report what agent can't do, what user needs to provide.
- Scope reduction — No alt + no fallback → drop dependent sub-task (Step 5).
Example routing decision:
Tool needed: Grep (circuit OPEN)
Task: find all files containing "API_KEY"
Route 1: Bash with rg command
→ Degradation: loses Grep's built-in formatting
→ Decision: ACCEPTABLE — use this route
If Bash also OPEN:
Route 2: Read suspected config files directly
→ Degradation: requires guessing which files; no broad search
→ Decision: PARTIAL — try known config paths only
If Read also OPEN:
Route 3: Ask user
→ "I need to find files containing 'API_KEY' but my search
tools are unavailable. Can you run: grep -r 'API_KEY' ."
→ Decision: FALLBACK — user provides the information
If user unavailable:
Route 4: Scope reduction
→ Remove "find API key references" from task scope
→ Document: "SKIPPED: API key search — no tools available"
→ Tool circuit opens → agent transparently routes to alt or degrades scope. Decision + degradation documented in out → user knows what's affected.
If err: Map incomplete (no alts listed) → default scope reduction + report. NEVER silently skip → always doc what + why.
Step 5: Reduce Scope to Achievable Work
Tools OPEN + alts exhausted → shrink task to what working tools can do. Not failure → honest scope mgmt.
Scope reduction proc:
- List remaining sub-tasks
- Each sub-task → check tools required
- All required tools CLOSED or viable alts → keep
- Any required tool OPEN no alt → mark DEFERRED
- Continue w/ reduced scope
- Report deferred at end
Scope Reduction Report:
Original scope: 5 sub-tasks
[x] 1. Read configuration files (Read: CLOSED)
[x] 2. Search for deprecated patterns (Grep: CLOSED)
[ ] 3. Run test suite (Bash: OPEN — no alternative)
[x] 4. Update documentation (Edit: CLOSED)
[ ] 5. Deploy to staging (Bash: OPEN — no alternative)
Reduced scope: 3 sub-tasks achievable
Deferred: 2 sub-tasks require Bash (circuit OPEN)
Recommendation: Complete sub-tasks 1, 2, 4 now.
Sub-tasks 3 and 5 require Bash — will probe on next cycle
or user can run commands manually.
Do NOT attempt deferred. Do NOT retry open-circuited tools hoping they work. Breaker exists to prevent this → trust its state.
→ Clear partition of task → achievable + deferred. Agent finishes achievable, reports deferred w/ reason + unblock path.
If err: Scope reduction removes all sub-tasks (all tools broken) → skip to Step 6 pause-and-report. Agent w/ no working tools must not fake progress.
Step 6: Handle Staleness + Label Data Quality
Tool returns maybe-stale data (cached, old snapshot, prev fetched) → label explicit, not treat as fresh.
Staleness indicators:
- Tool out matches prev call exactly (cache hit?)
- Data timestamps older than current task
- Tool doc mentions caching
- Results contradict other recent observations
Labeling proc:
When presenting potentially stale data:
"[STALE DATA — retrieved at {timestamp}, may not reflect current state]
File contents as of last successful Read:
..."
"[CACHED RESULT — Grep returned identical results to previous call;
filesystem may have changed since]"
"[UNVERIFIED — WebSearch result from {date}; current status unknown]"
NEVER silently present stale as current. User / downstream agent must know data quality.
→ All maybe-stale outs labeled. Fresh not labeled (labels = uncertainty, not confirmation).
If err: Can't determine staleness (no timestamps, no baseline) → note: "[FRESHNESS UNKNOWN — no baseline for comparison]". Uncertainty = info.
Step 7: Enforce Failure Budget
Track total fails across all tools. Budget exhausted → pause + report vs. keep accumulating errs.
Failure Budget Enforcement:
Budget: 5 failures per cycle
Current: 4 / 5 consumed
Failure 1: Bash — "permission denied" (step 3)
Failure 2: Bash — "command not found" (step 3)
Failure 3: Bash — "timeout after 120s" (step 4)
Failure 4: WebSearch — "connection refused" (step 5)
Status: 1 failure remaining before mandatory pause
→ Next tool call proceeds with heightened caution
→ If it fails: PAUSE and generate status report
On budget exhaustion:
FAILURE BUDGET EXHAUSTED — PAUSING
Completed work:
- Sub-task 1: Read configuration files (SUCCESS)
- Sub-task 2: Search for deprecated patterns (SUCCESS)
Incomplete work:
- Sub-task 3: Run test suite (FAILED — Bash circuit OPEN)
- Sub-task 4: Update documentation (NOT ATTEMPTED — paused)
- Sub-task 5: Deploy to staging (NOT ATTEMPTED — paused)
Tool health:
Grep: CLOSED (healthy)
Read: CLOSED (healthy)
Edit: CLOSED (healthy)
Bash: OPEN (3 consecutive failures — permission/command/timeout)
WebSearch: OPEN (1 failure — connection refused)
Failures: 5 / 5 budget consumed
Recommendation:
1. Investigate Bash failures — likely environment issue
2. Check network connectivity for WebSearch
3. Resume from sub-task 4 after resolution
Pause-and-report = breaker in electrical systems → prevents damage accumulating. Agent that keeps calling broken tools wastes ctx, confuses user w/ repeat errs, inconsistent partial results.
→ Agent stops clean on budget exhaust. Report covers done work, incomplete work, tool health, actionable next steps.
If err: Can't generate clean report (state tracking lost) → out whatever avail. Partial report > silent continuation.
Step 8: Separation of Concerns — Expeditor vs. Executor
Valid. orchestration logic (Steps 2-7) cleanly separated from tool exec.
Expeditor (orch) does:
- Track tool health state
- Decide call, skip, probe
- Route to alts on open
- Enforce fail budget
- Generate status reports
Expeditor does NOT:
- Retry failed calls immediately
- Modify call params to work around errs
- Catch + suppress tool errs
- Assume why tool failed
- Exec fallback logic that itself needs tools
Expeditor "cooking" (calling tools to work around other fails) → separation broken. Expeditor routes to alt or shrinks scope, NOT fixes broken tool.
→ Clean boundary orch decisions vs. tool exec. Expeditor described w/o ref to specific tool APIs or err types.
If err: Orch + exec entangled → refactor → extract decision logic to separate step before each tool call. Decision step outs: CALL, SKIP, PROBE, PAUSE. Exec step acts on out.
Step 9: Detect Cascading Failures
Many tools share infra (network, fs, perms) → single root cause trips many breakers at once. Detect correlated pattern vs. treat each breaker indep.
Cascade indicators:
- 3+ tools OPEN in same task step / narrow window
- Fails share common err signature (e.g., "connection refused," "permission denied")
- Tools w/ indep fail history suddenly fail together
Response proc:
- Second breaker opens → check if fail category matches first
- Correlated → flag systemic failure → pause all tool calls, not just broken
- Report suspected root: "Multiple tools failing with [shared pattern] — likely [network/filesystem/permissions] issue"
- Don't probe half-open during systemic fail → probes also fail, waste budget
- Resume probing only after user confirms infra fixed
Backoff compounding: Cascade trips → exponential backoff for half-open probes: probe step 3, then 6, then 12. Cap max interval 20 steps → prevent permanent circuit lock. Stops rapid-fire probes overwhelming recovering system.
→ Correlated fails detected + treated as single systemic event, not N indep trips. Fail budget counts systemic event once, not N times.
If err: Correlation detection impractical (diff err sigs, shared cause) → fallback indep per-tool breakers. System still degrades → just consumes budget faster.
Step 10: Pre-Call Tool Selection Layer
Before circuit breaker loop (Step 3) → optionally valid. tool available + likely succeed. Cuts unnecessary trips from predictable fails.
Pre-call checks:
| Check | Method | Action on failure |
|---|---|---|
| Tool exists | Verify tool is in the allowed-tools list | Skip — do not even attempt |
| MCP server health | Check server process/connection status | Route to alternative immediately |
| Resource availability | Verify target file/URL/endpoint exists | Route or degrade scope |
Decision table:
Pre-call score:
AVAILABLE → proceed to circuit breaker loop (Step 3)
DEGRADED → proceed with caution, lower the failure threshold by 1
UNAVAILABLE → skip tool, route to alternative (Step 4) without consuming budget
Pre-call = advisory, not authoritative. Tool passing pre-call can still fail during exec. Breaker = primary reliability mechanism.
→ Predictable fails (missing tools, unreachable servers) caught before budget consumed. Breaker handles only genuine runtime fails.
If err: Pre-call checks unavail or too much overhead → skip entirely. Step 3 breaker loop handles all fails → pre-call = opt, not req.
Check
- Map covers all tools w/ alts + fallbacks documented
- Breaker state table init'd all tools
- Transitions follow CLOSED → OPEN → HALF-OPEN → CLOSED
- Fail threshold declared explicit (not implicit)
- Alt routing tried before scope reduction
- Scope reduction doc'd w/ deferred + reasons
- Stale data labeled explicit — never fresh
- Fail budget enforced w/ pause-and-report on exhaust
- Expeditor logic does NOT exec tool calls or retry fails
- Status report: done work, incomplete work, tool health
- No silent fails → every skip, deferral, degradation doc'd
- Cascade fails detected when 3+ tools open together
- Systemic fail pauses all probes until infra confirmed recovered
- Pre-call checks (if used) don't consume budget on predictable fails
Traps
- Retry vs. circuit-break: Calling broken tool repeat wastes budget + ctx. 3 consecutive fails = pattern, not bad luck. OPEN it.
- Cooking in expeditor: Orch decides what, not how to fix broken. Expeditor crafting workaround cmds for Bash fails → crossed the boundary.
- Silent scope reduction: Dropping sub-tasks w/o doc → looks complete, isn't. Always report skipped.
- Treat stale as fresh: Cached/prev results maybe not current. Label uncertainty, don't ignore.
- Open circuits too eagerly: Single transient fail shouldn't open. Threshold (default: 3) filters noise from signal.
- Never probe post-open: Permanently open → agent never finds recovery. Half-open probes essential.
- Ignore fail budget: W/o budget → agent accumulates dozens of fails across tools while "progressing" on paper. Budget forces honest checkpoint.
- Cascade backoff multiply: Many tools in dep chain each apply own exp backoff → compound delay grows multiplicatively. Cap total aggregate backoff across chain, not just per tool.
- Stale discovery scores: Step 10 caches tool avail. No invalidation when conditions change → agent skips recovered tool or attempts unavail. Re-check scores after systemic fail event.
→
fail-early-pattern— complementary: fail-early validates in before work; circuit-breaker manages fails during workescalate-issues— budget exhausted / scope reduction significant → escalate to specialist or humanwrite-incident-runbook— doc recurring fail patterns as runbooksassess-context— eval if current approach can adapt when many tools degraded; pairs w/ scope reductiondu-dum— two-clock arch sep'ng observe from decide; complement for cutting observe cost in agent loops
GitHub 仓库
相关推荐技能
content-collections
元Content Collections 是一个 TypeScript 优先的构建工具,可将本地 Markdown/MDX 文件转换为类型安全的数据集合。它专为构建博客、文档站和内容密集型 Vite+React 应用而设计,提供基于 Zod 的自动模式验证。该工具涵盖从 Vite 插件配置、MDX 编译到生产环境部署的完整工作流。
polymarket
元这个Claude Skill为开发者提供完整的Polymarket预测市场开发支持,涵盖API调用、交易执行和市场数据分析。关键特性包括实时WebSocket数据流,可监控实时交易、订单和市场动态。开发者可用它构建预测市场应用、实施交易策略并集成实时市场预测功能。
creating-opencode-plugins
元该Skill帮助开发者创建OpenCode插件,用于接入命令、文件、LSP等25+种事件。它提供了插件结构、事件API规范和JavaScript/TypeScript实现模式,适合需要拦截操作、扩展功能或自定义事件处理的场景。开发者可通过它快速构建响应式模块来增强OpenCode AI助手的能力。
sglang
元SGLang是一个专为LLM设计的高性能推理框架,特别适用于需要结构化输出的场景。它通过RadixAttention前缀缓存技术,在处理JSON、正则表达式、工具调用等具有重复前缀的复杂工作流时,能实现极速生成。如果你正在构建智能体或多轮对话系统,并追求远超vLLM的推理性能,SGLang是理想选择。
