circuit-breaker-pattern
О программе
Этот навык реализует паттерн "Предохранитель" для агентных рабочих процессов, чтобы предотвращать каскадные сбои инструментов. Он отслеживает работоспособность инструментов, управляет переходами состояний и перенаправляет вызовы к альтернативам с использованием карт возможностей при возникновении сбоев. Используйте его для создания отказоустойчивых агентов, которые корректно обрабатывают ненадёжные инструменты и восстанавливаются после сбоев в процессе выполнения задачи.
Быстрая установка
Claude Code
Рекомендуетсяnpx skills add pjt222/agent-almanac -a claude-code/plugin add https://github.com/pjt222/agent-almanacgit clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/circuit-breaker-patternСкопируйте и вставьте эту команду в Claude Code для установки этого навыка
Документация
Circuit Breaker Pattern
Graceful degradation when tools fail. Agent w/ 5 tools, 1 broken → don't fail whole → spot broken tool, stop calling, shrink scope to achievable, report honest what skipped. Codifies circuit breaker from distributed systems → agentic tool orchestration.
Core insight from kirapixelads' "Kitchen Fire Problem": expeditor (orch layer) must NOT cook. Separate what to attempt from how → orchestrator stays out of broken tool's retry loop.
Use When
- Building agents w/ many tools, varying reliability
- Fault-tolerant workflows → partial > total failure
- Agent stuck in retry loop on broken tool, not moving forward
- Mid-task tool outage → graceful recovery
- Hardening existing agents vs. cascading failures
- Stale/cached tool out being treated as fresh
In
- Required: Tool list (names + purposes)
- Required: Task to accomplish
- Optional: Known reliability issues / past fail patterns
- Optional: Fail threshold (default: 3 consecutive fails → open)
- Optional: Fail budget per cycle (default: 5 total fails → pause)
- Optional: Half-open probe interval (default: every 3rd attempt post-open)
Do
Step 1: Build Capability Map
Declare each tool's capability + alternatives. Map = foundation for scope reduction → w/o it, fail leaves agent guessing.
capability_map:
- tool: Grep
provides: content search across files
alternatives:
- tool: Bash
method: "rg or grep command"
degradation: "loses Grep's built-in output formatting"
- tool: Read
method: "read suspected files directly"
degradation: "requires knowing which files to check; no broad search"
fallback: "ask the user which files to examine"
- tool: Bash
provides: command execution, build tools, git operations
alternatives: []
fallback: "report commands that need to be run manually"
- tool: Read
provides: file content inspection
alternatives:
- tool: Bash
method: "cat or head command"
degradation: "loses line numbering and truncation safety"
fallback: "ask the user to paste file contents"
- tool: Write
provides: file creation
alternatives:
- tool: Edit
method: "create via full-file edit"
degradation: "requires file to already exist for Edit"
- tool: Bash
method: "echo/cat heredoc"
degradation: "loses Write's atomic file creation"
fallback: "output file contents for the user to save manually"
- tool: WebSearch
provides: external information retrieval
alternatives: []
fallback: "state what information is needed; ask user to provide it"
Each tool, doc:
- Capability (one line)
- Alternative tools (w/ degradation notes)
- Manual fallback when no tool alternative
→ Full map covers every tool agent uses. Each entry has fallback even if no tool alt. Map makes explicit what's implicit: critical tools (no alts) vs. substitutable.
If err: Tool list unclear → start w/ allowed-tools from skill frontmatter. Alts uncertain → mark degradation: "unknown — test before relying on this route" vs. omit.
Step 2: Initialize Circuit Breaker State
State tracker per tool. All tools start CLOSED (healthy).
Circuit Breaker State Table:
+------------+--------+-------------------+------------------+-----------------+
| Tool | State | Consecutive Fails | Last Failure | Last Success |
+------------+--------+-------------------+------------------+-----------------+
| Grep | CLOSED | 0 | — | — |
| Bash | CLOSED | 0 | — | — |
| Read | CLOSED | 0 | — | — |
| Write | CLOSED | 0 | — | — |
| Edit | CLOSED | 0 | — | — |
| WebSearch | CLOSED | 0 | — | — |
+------------+--------+-------------------+------------------+-----------------+
Failure budget: 0 / 5 consumed
State defs:
- CLOSED — Tool healthy. Use normally. Track consecutive fails.
- OPEN — Tool known-broken. Don't call. Route to alts or shrink scope.
- HALF-OPEN — Tool was broken, maybe recovered. Single probe call. Success → CLOSED. Fail → OPEN.
Transitions:
- CLOSED → OPEN: Consecutive fails ≥ threshold (default: 3)
- OPEN → HALF-OPEN: After interval (e.g., every 3rd task step)
- HALF-OPEN → CLOSED: Probe success
- HALF-OPEN → OPEN: Probe fail
→ State table init'd all tools CLOSED, zero fails. Threshold + budget declared.
If err: Can't enumerate tools upfront (dynamic discovery) → init state on first use. Pattern still works → build table incrementally.
Step 3: Implement Call-and-Track Loop
Agent needs tool call → follow decision seq. This = expeditor logic → decides whether to call, not how to execute.
BEFORE each tool call:
1. Check tool state in the circuit breaker table
2. If OPEN:
a. Check if it is time for a half-open probe
- Yes → transition to HALF-OPEN, proceed with probe call
- No → skip this tool, route to alternative (Step 4)
3. If HALF-OPEN:
a. Make one probe call
b. Success → transition to CLOSED, reset consecutive fails to 0
c. Failure → transition to OPEN, increment failure budget
4. If CLOSED:
a. Make the call normally
AFTER each tool call:
1. Success:
- Reset consecutive fails to 0
- Record last success timestamp
2. Failure:
- Increment consecutive fails
- Record last failure timestamp and error message
- Increment failure budget consumed
- If consecutive fails >= threshold:
transition to OPEN
log: "Circuit OPENED for [tool]: [failure count] consecutive failures"
- If failure budget exhausted:
PAUSE — do not continue the task
Report to user (Step 6)
Expeditor NEVER retries failed call immediately. Record fail, check thresholds, move on. Retries via HALF-OPEN probe at later step only.
→ Clear decision loop before + after every tool call. Tool health tracked continuous. Expeditor layer never blocks on failing tool.
If err: Tracking state across calls impractical (stateless exec) → degrade to simpler: count total fails, pause at budget. Three-state breaker = ideal; fail counter = min viable.
Step 4: Route to Alternatives on Open Circuit
Tool OPEN → consult capability map (Step 1), route to best alt.
Routing priority:
- Tool alt, low degradation — Similar capability tool. Note degradation in out.
- Tool alt, high degradation — Big capability loss. Label what's missing.
- Manual fallback — Report what agent can't do, what user needs to provide.
- Scope reduction — No alt + no fallback → drop dependent sub-task (Step 5).
Example routing decision:
Tool needed: Grep (circuit OPEN)
Task: find all files containing "API_KEY"
Route 1: Bash with rg command
→ Degradation: loses Grep's built-in formatting
→ Decision: ACCEPTABLE — use this route
If Bash also OPEN:
Route 2: Read suspected config files directly
→ Degradation: requires guessing which files; no broad search
→ Decision: PARTIAL — try known config paths only
If Read also OPEN:
Route 3: Ask user
→ "I need to find files containing 'API_KEY' but my search
tools are unavailable. Can you run: grep -r 'API_KEY' ."
→ Decision: FALLBACK — user provides the information
If user unavailable:
Route 4: Scope reduction
→ Remove "find API key references" from task scope
→ Document: "SKIPPED: API key search — no tools available"
→ Tool circuit opens → agent transparently routes to alt or degrades scope. Decision + degradation documented in out → user knows what's affected.
If err: Map incomplete (no alts listed) → default scope reduction + report. NEVER silently skip → always doc what + why.
Step 5: Reduce Scope to Achievable Work
Tools OPEN + alts exhausted → shrink task to what working tools can do. Not failure → honest scope mgmt.
Scope reduction proc:
- List remaining sub-tasks
- Each sub-task → check tools required
- All required tools CLOSED or viable alts → keep
- Any required tool OPEN no alt → mark DEFERRED
- Continue w/ reduced scope
- Report deferred at end
Scope Reduction Report:
Original scope: 5 sub-tasks
[x] 1. Read configuration files (Read: CLOSED)
[x] 2. Search for deprecated patterns (Grep: CLOSED)
[ ] 3. Run test suite (Bash: OPEN — no alternative)
[x] 4. Update documentation (Edit: CLOSED)
[ ] 5. Deploy to staging (Bash: OPEN — no alternative)
Reduced scope: 3 sub-tasks achievable
Deferred: 2 sub-tasks require Bash (circuit OPEN)
Recommendation: Complete sub-tasks 1, 2, 4 now.
Sub-tasks 3 and 5 require Bash — will probe on next cycle
or user can run commands manually.
Do NOT attempt deferred. Do NOT retry open-circuited tools hoping they work. Breaker exists to prevent this → trust its state.
→ Clear partition of task → achievable + deferred. Agent finishes achievable, reports deferred w/ reason + unblock path.
If err: Scope reduction removes all sub-tasks (all tools broken) → skip to Step 6 pause-and-report. Agent w/ no working tools must not fake progress.
Step 6: Handle Staleness + Label Data Quality
Tool returns maybe-stale data (cached, old snapshot, prev fetched) → label explicit, not treat as fresh.
Staleness indicators:
- Tool out matches prev call exactly (cache hit?)
- Data timestamps older than current task
- Tool doc mentions caching
- Results contradict other recent observations
Labeling proc:
When presenting potentially stale data:
"[STALE DATA — retrieved at {timestamp}, may not reflect current state]
File contents as of last successful Read:
..."
"[CACHED RESULT — Grep returned identical results to previous call;
filesystem may have changed since]"
"[UNVERIFIED — WebSearch result from {date}; current status unknown]"
NEVER silently present stale as current. User / downstream agent must know data quality.
→ All maybe-stale outs labeled. Fresh not labeled (labels = uncertainty, not confirmation).
If err: Can't determine staleness (no timestamps, no baseline) → note: "[FRESHNESS UNKNOWN — no baseline for comparison]". Uncertainty = info.
Step 7: Enforce Failure Budget
Track total fails across all tools. Budget exhausted → pause + report vs. keep accumulating errs.
Failure Budget Enforcement:
Budget: 5 failures per cycle
Current: 4 / 5 consumed
Failure 1: Bash — "permission denied" (step 3)
Failure 2: Bash — "command not found" (step 3)
Failure 3: Bash — "timeout after 120s" (step 4)
Failure 4: WebSearch — "connection refused" (step 5)
Status: 1 failure remaining before mandatory pause
→ Next tool call proceeds with heightened caution
→ If it fails: PAUSE and generate status report
On budget exhaustion:
FAILURE BUDGET EXHAUSTED — PAUSING
Completed work:
- Sub-task 1: Read configuration files (SUCCESS)
- Sub-task 2: Search for deprecated patterns (SUCCESS)
Incomplete work:
- Sub-task 3: Run test suite (FAILED — Bash circuit OPEN)
- Sub-task 4: Update documentation (NOT ATTEMPTED — paused)
- Sub-task 5: Deploy to staging (NOT ATTEMPTED — paused)
Tool health:
Grep: CLOSED (healthy)
Read: CLOSED (healthy)
Edit: CLOSED (healthy)
Bash: OPEN (3 consecutive failures — permission/command/timeout)
WebSearch: OPEN (1 failure — connection refused)
Failures: 5 / 5 budget consumed
Recommendation:
1. Investigate Bash failures — likely environment issue
2. Check network connectivity for WebSearch
3. Resume from sub-task 4 after resolution
Pause-and-report = breaker in electrical systems → prevents damage accumulating. Agent that keeps calling broken tools wastes ctx, confuses user w/ repeat errs, inconsistent partial results.
→ Agent stops clean on budget exhaust. Report covers done work, incomplete work, tool health, actionable next steps.
If err: Can't generate clean report (state tracking lost) → out whatever avail. Partial report > silent continuation.
Step 8: Separation of Concerns — Expeditor vs. Executor
Valid. orchestration logic (Steps 2-7) cleanly separated from tool exec.
Expeditor (orch) does:
- Track tool health state
- Decide call, skip, probe
- Route to alts on open
- Enforce fail budget
- Generate status reports
Expeditor does NOT:
- Retry failed calls immediately
- Modify call params to work around errs
- Catch + suppress tool errs
- Assume why tool failed
- Exec fallback logic that itself needs tools
Expeditor "cooking" (calling tools to work around other fails) → separation broken. Expeditor routes to alt or shrinks scope, NOT fixes broken tool.
→ Clean boundary orch decisions vs. tool exec. Expeditor described w/o ref to specific tool APIs or err types.
If err: Orch + exec entangled → refactor → extract decision logic to separate step before each tool call. Decision step outs: CALL, SKIP, PROBE, PAUSE. Exec step acts on out.
Step 9: Detect Cascading Failures
Many tools share infra (network, fs, perms) → single root cause trips many breakers at once. Detect correlated pattern vs. treat each breaker indep.
Cascade indicators:
- 3+ tools OPEN in same task step / narrow window
- Fails share common err signature (e.g., "connection refused," "permission denied")
- Tools w/ indep fail history suddenly fail together
Response proc:
- Second breaker opens → check if fail category matches first
- Correlated → flag systemic failure → pause all tool calls, not just broken
- Report suspected root: "Multiple tools failing with [shared pattern] — likely [network/filesystem/permissions] issue"
- Don't probe half-open during systemic fail → probes also fail, waste budget
- Resume probing only after user confirms infra fixed
Backoff compounding: Cascade trips → exponential backoff for half-open probes: probe step 3, then 6, then 12. Cap max interval 20 steps → prevent permanent circuit lock. Stops rapid-fire probes overwhelming recovering system.
→ Correlated fails detected + treated as single systemic event, not N indep trips. Fail budget counts systemic event once, not N times.
If err: Correlation detection impractical (diff err sigs, shared cause) → fallback indep per-tool breakers. System still degrades → just consumes budget faster.
Step 10: Pre-Call Tool Selection Layer
Before circuit breaker loop (Step 3) → optionally valid. tool available + likely succeed. Cuts unnecessary trips from predictable fails.
Pre-call checks:
| Check | Method | Action on failure |
|---|---|---|
| Tool exists | Verify tool is in the allowed-tools list | Skip — do not even attempt |
| MCP server health | Check server process/connection status | Route to alternative immediately |
| Resource availability | Verify target file/URL/endpoint exists | Route or degrade scope |
Decision table:
Pre-call score:
AVAILABLE → proceed to circuit breaker loop (Step 3)
DEGRADED → proceed with caution, lower the failure threshold by 1
UNAVAILABLE → skip tool, route to alternative (Step 4) without consuming budget
Pre-call = advisory, not authoritative. Tool passing pre-call can still fail during exec. Breaker = primary reliability mechanism.
→ Predictable fails (missing tools, unreachable servers) caught before budget consumed. Breaker handles only genuine runtime fails.
If err: Pre-call checks unavail or too much overhead → skip entirely. Step 3 breaker loop handles all fails → pre-call = opt, not req.
Check
- Map covers all tools w/ alts + fallbacks documented
- Breaker state table init'd all tools
- Transitions follow CLOSED → OPEN → HALF-OPEN → CLOSED
- Fail threshold declared explicit (not implicit)
- Alt routing tried before scope reduction
- Scope reduction doc'd w/ deferred + reasons
- Stale data labeled explicit — never fresh
- Fail budget enforced w/ pause-and-report on exhaust
- Expeditor logic does NOT exec tool calls or retry fails
- Status report: done work, incomplete work, tool health
- No silent fails → every skip, deferral, degradation doc'd
- Cascade fails detected when 3+ tools open together
- Systemic fail pauses all probes until infra confirmed recovered
- Pre-call checks (if used) don't consume budget on predictable fails
Traps
- Retry vs. circuit-break: Calling broken tool repeat wastes budget + ctx. 3 consecutive fails = pattern, not bad luck. OPEN it.
- Cooking in expeditor: Orch decides what, not how to fix broken. Expeditor crafting workaround cmds for Bash fails → crossed the boundary.
- Silent scope reduction: Dropping sub-tasks w/o doc → looks complete, isn't. Always report skipped.
- Treat stale as fresh: Cached/prev results maybe not current. Label uncertainty, don't ignore.
- Open circuits too eagerly: Single transient fail shouldn't open. Threshold (default: 3) filters noise from signal.
- Never probe post-open: Permanently open → agent never finds recovery. Half-open probes essential.
- Ignore fail budget: W/o budget → agent accumulates dozens of fails across tools while "progressing" on paper. Budget forces honest checkpoint.
- Cascade backoff multiply: Many tools in dep chain each apply own exp backoff → compound delay grows multiplicatively. Cap total aggregate backoff across chain, not just per tool.
- Stale discovery scores: Step 10 caches tool avail. No invalidation when conditions change → agent skips recovered tool or attempts unavail. Re-check scores after systemic fail event.
→
fail-early-pattern— complementary: fail-early validates in before work; circuit-breaker manages fails during workescalate-issues— budget exhausted / scope reduction significant → escalate to specialist or humanwrite-incident-runbook— doc recurring fail patterns as runbooksassess-context— eval if current approach can adapt when many tools degraded; pairs w/ scope reductiondu-dum— two-clock arch sep'ng observe from decide; complement for cutting observe cost in agent loops
GitHub репозиторий
Похожие навыки
content-collections
МетаЭтот навык предоставляет проверенную в продакшене настройку для Content Collections — TypeScript-ориентированного инструмента, который преобразует файлы Markdown/MDX в типобезопасные коллекции данных с валидацией Zod. Используйте его при создании блогов, сайтов документации или контентных приложений на Vite + React для обеспечения типобезопасности и автоматической проверки содержимого. Он охватывает всё: от настройки плагина Vite и компиляции MDX до оптимизации развертывания и валидации схем.
polymarket
МетаЭтот навык позволяет разработчикам создавать приложения на платформе прогнозных рынков Polymarket, включая интеграцию с API для торговли и получения рыночных данных. Он также обеспечивает потоковую передачу данных в реальном времени через WebSocket для отслеживания текущих сделок и рыночной активности. Используйте его для реализации торговых стратегий или создания инструментов, обрабатывающих обновления рынка в реальном времени.
creating-opencode-plugins
МетаЭтот навык помогает разработчикам создавать плагины OpenCode, которые подключаются к более чем 25 типам событий, таким как команды, файлы и операции LSP. Он предоставляет структуру плагина, спецификации API событий и шаблоны реализации для модулей на JavaScript/TypeScript. Используйте его, когда вам нужно перехватывать, отслеживать или расширять жизненный цикл ассистента OpenCode AI с помощью пользовательской событийно-ориентированной логики.
sglang
МетаSGLang — это высокопроизводительный фреймворк для обслуживания больших языковых моделей (LLM), специализирующийся на быстрой структурированной генерации JSON, regex и рабочих процессов агентов с использованием кэширования префиксов RadixAttention. Он обеспечивает значительно более высокую скорость вывода, особенно для задач с повторяющимися префиксами, что делает его идеальным для сложных структурированных результатов и многократных диалогов. Выбирайте SGLang вместо альтернатив, таких как vLLM, когда вам требуется ограниченное декодирование или вы создаете приложения с интенсивным совместным использованием префиксов.
