SKILL·D2B00E

circuit-breaker-pattern

Name: circuit-breaker-pattern
Author: pjt222

pjt222

更新于 1 month ago

9 次查看

元aiautomationdesign

关于

This skill implements the circuit breaker pattern for agentic workflows to prevent cascading tool failures. It tracks tool health, manages state transitions, and routes calls to alternatives using capability maps when failures occur. Use it to build fault-tolerant agents that gracefully handle unreliable tools and recover from outages mid-task.

快速安装

Claude Code

技能文档

Circuit Breaker Pattern

Graceful degradation when tools fail. Agent w/ 5 tools, 1 broken → don't fail whole → spot broken tool, stop calling, shrink scope to achievable, report honest what skipped. Codifies circuit breaker from distributed systems → agentic tool orchestration.

Core insight from kirapixelads' "Kitchen Fire Problem": expeditor (orch layer) must NOT cook. Separate what to attempt from how → orchestrator stays out of broken tool's retry loop.

Use When

Building agents w/ many tools, varying reliability
Fault-tolerant workflows → partial > total failure
Agent stuck in retry loop on broken tool, not moving forward
Mid-task tool outage → graceful recovery
Hardening existing agents vs. cascading failures
Stale/cached tool out being treated as fresh

In

Required: Tool list (names + purposes)
Required: Task to accomplish
Optional: Known reliability issues / past fail patterns
Optional: Fail threshold (default: 3 consecutive fails → open)
Optional: Fail budget per cycle (default: 5 total fails → pause)
Optional: Half-open probe interval (default: every 3rd attempt post-open)

Do

Step 1: Build Capability Map

Declare each tool's capability + alternatives. Map = foundation for scope reduction → w/o it, fail leaves agent guessing.

capability_map:
  - tool: Grep
    provides: content search across files
    alternatives:
      - tool: Bash
        method: "rg or grep command"
        degradation: "loses Grep's built-in output formatting"
      - tool: Read
        method: "read suspected files directly"
        degradation: "requires knowing which files to check; no broad search"
    fallback: "ask the user which files to examine"

  - tool: Bash
    provides: command execution, build tools, git operations
    alternatives: []
    fallback: "report commands that need to be run manually"

  - tool: Read
    provides: file content inspection
    alternatives:
      - tool: Bash
        method: "cat or head command"
        degradation: "loses line numbering and truncation safety"
    fallback: "ask the user to paste file contents"

  - tool: Write
    provides: file creation
    alternatives:
      - tool: Edit
        method: "create via full-file edit"
        degradation: "requires file to already exist for Edit"
      - tool: Bash
        method: "echo/cat heredoc"
        degradation: "loses Write's atomic file creation"
    fallback: "output file contents for the user to save manually"

  - tool: WebSearch
    provides: external information retrieval
    alternatives: []
    fallback: "state what information is needed; ask user to provide it"

Each tool, doc:

Capability (one line)
Alternative tools (w/ degradation notes)
Manual fallback when no tool alternative

→ Full map covers every tool agent uses. Each entry has fallback even if no tool alt. Map makes explicit what's implicit: critical tools (no alts) vs. substitutable.

If err: Tool list unclear → start w/ allowed-tools from skill frontmatter. Alts uncertain → mark degradation: "unknown — test before relying on this route" vs. omit.

Step 2: Initialize Circuit Breaker State

State tracker per tool. All tools start CLOSED (healthy).

Circuit Breaker State Table:
+------------+--------+-------------------+------------------+-----------------+
| Tool       | State  | Consecutive Fails | Last Failure     | Last Success    |
+------------+--------+-------------------+------------------+-----------------+
| Grep       | CLOSED | 0                 | —                | —               |
| Bash       | CLOSED | 0                 | —                | —               |
| Read       | CLOSED | 0                 | —                | —               |
| Write      | CLOSED | 0                 | —                | —               |
| Edit       | CLOSED | 0                 | —                | —               |
| WebSearch  | CLOSED | 0                 | —                | —               |
+------------+--------+-------------------+------------------+-----------------+

Failure budget: 0 / 5 consumed

State defs:

CLOSED — Tool healthy. Use normally. Track consecutive fails.
OPEN — Tool known-broken. Don't call. Route to alts or shrink scope.
HALF-OPEN — Tool was broken, maybe recovered. Single probe call. Success → CLOSED. Fail → OPEN.

Transitions:

CLOSED → OPEN: Consecutive fails ≥ threshold (default: 3)
OPEN → HALF-OPEN: After interval (e.g., every 3rd task step)
HALF-OPEN → CLOSED: Probe success
HALF-OPEN → OPEN: Probe fail

→ State table init'd all tools CLOSED, zero fails. Threshold + budget declared.

If err: Can't enumerate tools upfront (dynamic discovery) → init state on first use. Pattern still works → build table incrementally.

Step 3: Implement Call-and-Track Loop

Agent needs tool call → follow decision seq. This = expeditor logic → decides whether to call, not how to execute.

BEFORE each tool call:
  1. Check tool state in the circuit breaker table
  2. If OPEN:
     a. Check if it is time for a half-open probe
        - Yes → transition to HALF-OPEN, proceed with probe call
        - No  → skip this tool, route to alternative (Step 4)
  3. If HALF-OPEN:
     a. Make one probe call
     b. Success → transition to CLOSED, reset consecutive fails to 0
     c. Failure → transition to OPEN, increment failure budget
  4. If CLOSED:
     a. Make the call normally

AFTER each tool call:
  1. Success:
     - Reset consecutive fails to 0
     - Record last success timestamp
  2. Failure:
     - Increment consecutive fails
     - Record last failure timestamp and error message
     - Increment failure budget consumed
     - If consecutive fails >= threshold:
         transition to OPEN
         log: "Circuit OPENED for [tool]: [failure count] consecutive failures"
     - If failure budget exhausted:
         PAUSE — do not continue the task
         Report to user (Step 6)

Expeditor NEVER retries failed call immediately. Record fail, check thresholds, move on. Retries via HALF-OPEN probe at later step only.

→ Clear decision loop before + after every tool call. Tool health tracked continuous. Expeditor layer never blocks on failing tool.

If err: Tracking state across calls impractical (stateless exec) → degrade to simpler: count total fails, pause at budget. Three-state breaker = ideal; fail counter = min viable.

Step 4: Route to Alternatives on Open Circuit

Tool OPEN → consult capability map (Step 1), route to best alt.

Routing priority:

Tool alt, low degradation — Similar capability tool. Note degradation in out.
Tool alt, high degradation — Big capability loss. Label what's missing.
Manual fallback — Report what agent can't do, what user needs to provide.
Scope reduction — No alt + no fallback → drop dependent sub-task (Step 5).

Example routing decision:

Tool needed: Grep (circuit OPEN)
Task: find all files containing "API_KEY"

Route 1: Bash with rg command
  → Degradation: loses Grep's built-in formatting
  → Decision: ACCEPTABLE — use this route

If Bash also OPEN:
Route 2: Read suspected config files directly
  → Degradation: requires guessing which files; no broad search
  → Decision: PARTIAL — try known config paths only

If Read also OPEN:
Route 3: Ask user
  → "I need to find files containing 'API_KEY' but my search
     tools are unavailable. Can you run: grep -r 'API_KEY' ."
  → Decision: FALLBACK — user provides the information

If user unavailable:
Route 4: Scope reduction
  → Remove "find API key references" from task scope
  → Document: "SKIPPED: API key search — no tools available"

→ Tool circuit opens → agent transparently routes to alt or degrades scope. Decision + degradation documented in out → user knows what's affected.

If err: Map incomplete (no alts listed) → default scope reduction + report. NEVER silently skip → always doc what + why.

Step 5: Reduce Scope to Achievable Work

Tools OPEN + alts exhausted → shrink task to what working tools can do. Not failure → honest scope mgmt.

Scope reduction proc:

List remaining sub-tasks
Each sub-task → check tools required
All required tools CLOSED or viable alts → keep
Any required tool OPEN no alt → mark DEFERRED
Continue w/ reduced scope
Report deferred at end

Scope Reduction Report:

Original scope: 5 sub-tasks
  [x] 1. Read configuration files          (Read: CLOSED)
  [x] 2. Search for deprecated patterns    (Grep: CLOSED)
  [ ] 3. Run test suite                    (Bash: OPEN — no alternative)
  [x] 4. Update documentation             (Edit: CLOSED)
  [ ] 5. Deploy to staging                 (Bash: OPEN — no alternative)

Reduced scope: 3 sub-tasks achievable
Deferred: 2 sub-tasks require Bash (circuit OPEN)

Recommendation: Complete sub-tasks 1, 2, 4 now.
Sub-tasks 3 and 5 require Bash — will probe on next cycle
or user can run commands manually.

Do NOT attempt deferred. Do NOT retry open-circuited tools hoping they work. Breaker exists to prevent this → trust its state.

→ Clear partition of task → achievable + deferred. Agent finishes achievable, reports deferred w/ reason + unblock path.

If err: Scope reduction removes all sub-tasks (all tools broken) → skip to Step 6 pause-and-report. Agent w/ no working tools must not fake progress.

Step 6: Handle Staleness + Label Data Quality

Tool returns maybe-stale data (cached, old snapshot, prev fetched) → label explicit, not treat as fresh.

Staleness indicators:

Tool out matches prev call exactly (cache hit?)
Data timestamps older than current task
Tool doc mentions caching
Results contradict other recent observations

Labeling proc:

When presenting potentially stale data:

"[STALE DATA — retrieved at {timestamp}, may not reflect current state]
 File contents as of last successful Read:
 ..."

"[CACHED RESULT — Grep returned identical results to previous call;
 filesystem may have changed since]"

"[UNVERIFIED — WebSearch result from {date}; current status unknown]"

NEVER silently present stale as current. User / downstream agent must know data quality.

→ All maybe-stale outs labeled. Fresh not labeled (labels = uncertainty, not confirmation).

If err: Can't determine staleness (no timestamps, no baseline) → note: "[FRESHNESS UNKNOWN — no baseline for comparison]". Uncertainty = info.

Step 7: Enforce Failure Budget

Track total fails across all tools. Budget exhausted → pause + report vs. keep accumulating errs.

Failure Budget Enforcement:

Budget: 5 failures per cycle
Current: 4 / 5 consumed

  Failure 1: Bash — "permission denied" (step 3)
  Failure 2: Bash — "command not found" (step 3)
  Failure 3: Bash — "timeout after 120s" (step 4)
  Failure 4: WebSearch — "connection refused" (step 5)

Status: 1 failure remaining before mandatory pause

→ Next tool call proceeds with heightened caution
→ If it fails: PAUSE and generate status report

On budget exhaustion:

FAILURE BUDGET EXHAUSTED — PAUSING

Completed work:
  - Sub-task 1: Read configuration files (SUCCESS)
  - Sub-task 2: Search for deprecated patterns (SUCCESS)

Incomplete work:
  - Sub-task 3: Run test suite (FAILED — Bash circuit OPEN)
  - Sub-task 4: Update documentation (NOT ATTEMPTED — paused)
  - Sub-task 5: Deploy to staging (NOT ATTEMPTED — paused)

Tool health:
  Grep: CLOSED (healthy)
  Read: CLOSED (healthy)
  Edit: CLOSED (healthy)
  Bash: OPEN (3 consecutive failures — permission/command/timeout)
  WebSearch: OPEN (1 failure — connection refused)

Failures: 5 / 5 budget consumed

Recommendation:
  1. Investigate Bash failures — likely environment issue
  2. Check network connectivity for WebSearch
  3. Resume from sub-task 4 after resolution

Pause-and-report = breaker in electrical systems → prevents damage accumulating. Agent that keeps calling broken tools wastes ctx, confuses user w/ repeat errs, inconsistent partial results.

→ Agent stops clean on budget exhaust. Report covers done work, incomplete work, tool health, actionable next steps.

If err: Can't generate clean report (state tracking lost) → out whatever avail. Partial report > silent continuation.

Step 8: Separation of Concerns — Expeditor vs. Executor

Valid. orchestration logic (Steps 2-7) cleanly separated from tool exec.

Expeditor (orch) does:

Track tool health state
Decide call, skip, probe
Route to alts on open
Enforce fail budget
Generate status reports

Expeditor does NOT:

Retry failed calls immediately
Modify call params to work around errs
Catch + suppress tool errs
Assume why tool failed
Exec fallback logic that itself needs tools

Expeditor "cooking" (calling tools to work around other fails) → separation broken. Expeditor routes to alt or shrinks scope, NOT fixes broken tool.

→ Clean boundary orch decisions vs. tool exec. Expeditor described w/o ref to specific tool APIs or err types.

If err: Orch + exec entangled → refactor → extract decision logic to separate step before each tool call. Decision step outs: CALL, SKIP, PROBE, PAUSE. Exec step acts on out.

Step 9: Detect Cascading Failures

Many tools share infra (network, fs, perms) → single root cause trips many breakers at once. Detect correlated pattern vs. treat each breaker indep.

Cascade indicators:

3+ tools OPEN in same task step / narrow window
Fails share common err signature (e.g., "connection refused," "permission denied")
Tools w/ indep fail history suddenly fail together

Response proc:

Second breaker opens → check if fail category matches first
Correlated → flag systemic failure → pause all tool calls, not just broken
Report suspected root: "Multiple tools failing with [shared pattern] — likely [network/filesystem/permissions] issue"
Don't probe half-open during systemic fail → probes also fail, waste budget
Resume probing only after user confirms infra fixed

Backoff compounding: Cascade trips → exponential backoff for half-open probes: probe step 3, then 6, then 12. Cap max interval 20 steps → prevent permanent circuit lock. Stops rapid-fire probes overwhelming recovering system.

→ Correlated fails detected + treated as single systemic event, not N indep trips. Fail budget counts systemic event once, not N times.

If err: Correlation detection impractical (diff err sigs, shared cause) → fallback indep per-tool breakers. System still degrades → just consumes budget faster.

Step 10: Pre-Call Tool Selection Layer

Before circuit breaker loop (Step 3) → optionally valid. tool available + likely succeed. Cuts unnecessary trips from predictable fails.

Pre-call checks:

Check	Method	Action on failure
Tool exists	Verify tool is in the allowed-tools list	Skip — do not even attempt
MCP server health	Check server process/connection status	Route to alternative immediately
Resource availability	Verify target file/URL/endpoint exists	Route or degrade scope

Decision table:

Pre-call score:
  AVAILABLE  → proceed to circuit breaker loop (Step 3)
  DEGRADED   → proceed with caution, lower the failure threshold by 1
  UNAVAILABLE → skip tool, route to alternative (Step 4) without consuming budget

Pre-call = advisory, not authoritative. Tool passing pre-call can still fail during exec. Breaker = primary reliability mechanism.

→ Predictable fails (missing tools, unreachable servers) caught before budget consumed. Breaker handles only genuine runtime fails.

If err: Pre-call checks unavail or too much overhead → skip entirely. Step 3 breaker loop handles all fails → pre-call = opt, not req.

Check

Traps

Retry vs. circuit-break: Calling broken tool repeat wastes budget + ctx. 3 consecutive fails = pattern, not bad luck. OPEN it.
Cooking in expeditor: Orch decides what, not how to fix broken. Expeditor crafting workaround cmds for Bash fails → crossed the boundary.
Silent scope reduction: Dropping sub-tasks w/o doc → looks complete, isn't. Always report skipped.
Treat stale as fresh: Cached/prev results maybe not current. Label uncertainty, don't ignore.
Open circuits too eagerly: Single transient fail shouldn't open. Threshold (default: 3) filters noise from signal.
Never probe post-open: Permanently open → agent never finds recovery. Half-open probes essential.
Ignore fail budget: W/o budget → agent accumulates dozens of fails across tools while "progressing" on paper. Budget forces honest checkpoint.
Cascade backoff multiply: Many tools in dep chain each apply own exp backoff → compound delay grows multiplicatively. Cap total aggregate backoff across chain, not just per tool.
Stale discovery scores: Step 10 caches tool avail. No invalidation when conditions change → agent skips recovered tool or attempts unavail. Re-check scores after systemic fail event.

→

fail-early-pattern — complementary: fail-early validates in before work; circuit-breaker manages fails during work
escalate-issues — budget exhausted / scope reduction significant → escalate to specialist or human
write-incident-runbook — doc recurring fail patterns as runbooks
assess-context — eval if current approach can adapt when many tools degraded; pairs w/ scope reduction
du-dum — two-clock arch sep'ng observe from decide; complement for cutting observe cost in agent loops

GitHub 仓库

pjt222/agent-almanac

路径: i18n/caveman-ultra/skills/circuit-breaker-pattern

agentsagentskillsai-assisted-developmentclaude-codeskillsteams

FAQ

Frequently asked questions

What is the circuit-breaker-pattern skill?

circuit-breaker-pattern is a Claude Skill by pjt222. Skills package instructions and resources that Claude loads on demand, so Claude can perform circuit-breaker-pattern-related tasks without extra prompting.

How do I install circuit-breaker-pattern?

Use the install commands on this page: add circuit-breaker-pattern to Claude Code as a plugin, or clone its repository into your skills directory, then restart Claude so it picks up the skill.

What category does circuit-breaker-pattern belong to?

circuit-breaker-pattern is in the Meta category, tagged ai, automation and design.

Is circuit-breaker-pattern free to use?

Yes. circuit-breaker-pattern is listed on AIMCP and free to install. It runs inside Claude, so no separate service account is required to use the skill itself.