circuit-breaker-pattern
정보
이 스킬은 에이전트 워크플로우에서 연쇄적인 도구 장애를 방지하기 위해 서킷 브레이커 패턴을 구현합니다. 도구 상태를 추적하고, 상태 전환을 관리하며, 장애 발생 시 기능 매핑을 통해 대안으로 호출을 전달합니다. 이를 통해 신뢰할 수 없는 도구를 우아하게 처리하고 작업 중단 시 복구할 수 있는 내결함성 에이전트를 구축할 수 있습니다.
빠른 설치
Claude Code
추천npx skills add pjt222/agent-almanac -a claude-code/plugin add https://github.com/pjt222/agent-almanacgit clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/circuit-breaker-patternClaude Code에서 이 명령을 복사하여 붙여넣어 스킬을 설치하세요
문서
Circuit Breaker Pattern
Graceful degradation when tools fail. Agent w/ 5 tools, 1 broken → don't fail whole → spot broken tool, stop calling, shrink scope to achievable, report honest what skipped. Codifies circuit breaker from distributed systems → agentic tool orchestration.
Core insight from kirapixelads' "Kitchen Fire Problem": expeditor (orch layer) must NOT cook. Separate what to attempt from how → orchestrator stays out of broken tool's retry loop.
Use When
- Building agents w/ many tools, varying reliability
- Fault-tolerant workflows → partial > total failure
- Agent stuck in retry loop on broken tool, not moving forward
- Mid-task tool outage → graceful recovery
- Hardening existing agents vs. cascading failures
- Stale/cached tool out being treated as fresh
In
- Required: Tool list (names + purposes)
- Required: Task to accomplish
- Optional: Known reliability issues / past fail patterns
- Optional: Fail threshold (default: 3 consecutive fails → open)
- Optional: Fail budget per cycle (default: 5 total fails → pause)
- Optional: Half-open probe interval (default: every 3rd attempt post-open)
Do
Step 1: Build Capability Map
Declare each tool's capability + alternatives. Map = foundation for scope reduction → w/o it, fail leaves agent guessing.
capability_map:
- tool: Grep
provides: content search across files
alternatives:
- tool: Bash
method: "rg or grep command"
degradation: "loses Grep's built-in output formatting"
- tool: Read
method: "read suspected files directly"
degradation: "requires knowing which files to check; no broad search"
fallback: "ask the user which files to examine"
- tool: Bash
provides: command execution, build tools, git operations
alternatives: []
fallback: "report commands that need to be run manually"
- tool: Read
provides: file content inspection
alternatives:
- tool: Bash
method: "cat or head command"
degradation: "loses line numbering and truncation safety"
fallback: "ask the user to paste file contents"
- tool: Write
provides: file creation
alternatives:
- tool: Edit
method: "create via full-file edit"
degradation: "requires file to already exist for Edit"
- tool: Bash
method: "echo/cat heredoc"
degradation: "loses Write's atomic file creation"
fallback: "output file contents for the user to save manually"
- tool: WebSearch
provides: external information retrieval
alternatives: []
fallback: "state what information is needed; ask user to provide it"
Each tool, doc:
- Capability (one line)
- Alternative tools (w/ degradation notes)
- Manual fallback when no tool alternative
→ Full map covers every tool agent uses. Each entry has fallback even if no tool alt. Map makes explicit what's implicit: critical tools (no alts) vs. substitutable.
If err: Tool list unclear → start w/ allowed-tools from skill frontmatter. Alts uncertain → mark degradation: "unknown — test before relying on this route" vs. omit.
Step 2: Initialize Circuit Breaker State
State tracker per tool. All tools start CLOSED (healthy).
Circuit Breaker State Table:
+------------+--------+-------------------+------------------+-----------------+
| Tool | State | Consecutive Fails | Last Failure | Last Success |
+------------+--------+-------------------+------------------+-----------------+
| Grep | CLOSED | 0 | — | — |
| Bash | CLOSED | 0 | — | — |
| Read | CLOSED | 0 | — | — |
| Write | CLOSED | 0 | — | — |
| Edit | CLOSED | 0 | — | — |
| WebSearch | CLOSED | 0 | — | — |
+------------+--------+-------------------+------------------+-----------------+
Failure budget: 0 / 5 consumed
State defs:
- CLOSED — Tool healthy. Use normally. Track consecutive fails.
- OPEN — Tool known-broken. Don't call. Route to alts or shrink scope.
- HALF-OPEN — Tool was broken, maybe recovered. Single probe call. Success → CLOSED. Fail → OPEN.
Transitions:
- CLOSED → OPEN: Consecutive fails ≥ threshold (default: 3)
- OPEN → HALF-OPEN: After interval (e.g., every 3rd task step)
- HALF-OPEN → CLOSED: Probe success
- HALF-OPEN → OPEN: Probe fail
→ State table init'd all tools CLOSED, zero fails. Threshold + budget declared.
If err: Can't enumerate tools upfront (dynamic discovery) → init state on first use. Pattern still works → build table incrementally.
Step 3: Implement Call-and-Track Loop
Agent needs tool call → follow decision seq. This = expeditor logic → decides whether to call, not how to execute.
BEFORE each tool call:
1. Check tool state in the circuit breaker table
2. If OPEN:
a. Check if it is time for a half-open probe
- Yes → transition to HALF-OPEN, proceed with probe call
- No → skip this tool, route to alternative (Step 4)
3. If HALF-OPEN:
a. Make one probe call
b. Success → transition to CLOSED, reset consecutive fails to 0
c. Failure → transition to OPEN, increment failure budget
4. If CLOSED:
a. Make the call normally
AFTER each tool call:
1. Success:
- Reset consecutive fails to 0
- Record last success timestamp
2. Failure:
- Increment consecutive fails
- Record last failure timestamp and error message
- Increment failure budget consumed
- If consecutive fails >= threshold:
transition to OPEN
log: "Circuit OPENED for [tool]: [failure count] consecutive failures"
- If failure budget exhausted:
PAUSE — do not continue the task
Report to user (Step 6)
Expeditor NEVER retries failed call immediately. Record fail, check thresholds, move on. Retries via HALF-OPEN probe at later step only.
→ Clear decision loop before + after every tool call. Tool health tracked continuous. Expeditor layer never blocks on failing tool.
If err: Tracking state across calls impractical (stateless exec) → degrade to simpler: count total fails, pause at budget. Three-state breaker = ideal; fail counter = min viable.
Step 4: Route to Alternatives on Open Circuit
Tool OPEN → consult capability map (Step 1), route to best alt.
Routing priority:
- Tool alt, low degradation — Similar capability tool. Note degradation in out.
- Tool alt, high degradation — Big capability loss. Label what's missing.
- Manual fallback — Report what agent can't do, what user needs to provide.
- Scope reduction — No alt + no fallback → drop dependent sub-task (Step 5).
Example routing decision:
Tool needed: Grep (circuit OPEN)
Task: find all files containing "API_KEY"
Route 1: Bash with rg command
→ Degradation: loses Grep's built-in formatting
→ Decision: ACCEPTABLE — use this route
If Bash also OPEN:
Route 2: Read suspected config files directly
→ Degradation: requires guessing which files; no broad search
→ Decision: PARTIAL — try known config paths only
If Read also OPEN:
Route 3: Ask user
→ "I need to find files containing 'API_KEY' but my search
tools are unavailable. Can you run: grep -r 'API_KEY' ."
→ Decision: FALLBACK — user provides the information
If user unavailable:
Route 4: Scope reduction
→ Remove "find API key references" from task scope
→ Document: "SKIPPED: API key search — no tools available"
→ Tool circuit opens → agent transparently routes to alt or degrades scope. Decision + degradation documented in out → user knows what's affected.
If err: Map incomplete (no alts listed) → default scope reduction + report. NEVER silently skip → always doc what + why.
Step 5: Reduce Scope to Achievable Work
Tools OPEN + alts exhausted → shrink task to what working tools can do. Not failure → honest scope mgmt.
Scope reduction proc:
- List remaining sub-tasks
- Each sub-task → check tools required
- All required tools CLOSED or viable alts → keep
- Any required tool OPEN no alt → mark DEFERRED
- Continue w/ reduced scope
- Report deferred at end
Scope Reduction Report:
Original scope: 5 sub-tasks
[x] 1. Read configuration files (Read: CLOSED)
[x] 2. Search for deprecated patterns (Grep: CLOSED)
[ ] 3. Run test suite (Bash: OPEN — no alternative)
[x] 4. Update documentation (Edit: CLOSED)
[ ] 5. Deploy to staging (Bash: OPEN — no alternative)
Reduced scope: 3 sub-tasks achievable
Deferred: 2 sub-tasks require Bash (circuit OPEN)
Recommendation: Complete sub-tasks 1, 2, 4 now.
Sub-tasks 3 and 5 require Bash — will probe on next cycle
or user can run commands manually.
Do NOT attempt deferred. Do NOT retry open-circuited tools hoping they work. Breaker exists to prevent this → trust its state.
→ Clear partition of task → achievable + deferred. Agent finishes achievable, reports deferred w/ reason + unblock path.
If err: Scope reduction removes all sub-tasks (all tools broken) → skip to Step 6 pause-and-report. Agent w/ no working tools must not fake progress.
Step 6: Handle Staleness + Label Data Quality
Tool returns maybe-stale data (cached, old snapshot, prev fetched) → label explicit, not treat as fresh.
Staleness indicators:
- Tool out matches prev call exactly (cache hit?)
- Data timestamps older than current task
- Tool doc mentions caching
- Results contradict other recent observations
Labeling proc:
When presenting potentially stale data:
"[STALE DATA — retrieved at {timestamp}, may not reflect current state]
File contents as of last successful Read:
..."
"[CACHED RESULT — Grep returned identical results to previous call;
filesystem may have changed since]"
"[UNVERIFIED — WebSearch result from {date}; current status unknown]"
NEVER silently present stale as current. User / downstream agent must know data quality.
→ All maybe-stale outs labeled. Fresh not labeled (labels = uncertainty, not confirmation).
If err: Can't determine staleness (no timestamps, no baseline) → note: "[FRESHNESS UNKNOWN — no baseline for comparison]". Uncertainty = info.
Step 7: Enforce Failure Budget
Track total fails across all tools. Budget exhausted → pause + report vs. keep accumulating errs.
Failure Budget Enforcement:
Budget: 5 failures per cycle
Current: 4 / 5 consumed
Failure 1: Bash — "permission denied" (step 3)
Failure 2: Bash — "command not found" (step 3)
Failure 3: Bash — "timeout after 120s" (step 4)
Failure 4: WebSearch — "connection refused" (step 5)
Status: 1 failure remaining before mandatory pause
→ Next tool call proceeds with heightened caution
→ If it fails: PAUSE and generate status report
On budget exhaustion:
FAILURE BUDGET EXHAUSTED — PAUSING
Completed work:
- Sub-task 1: Read configuration files (SUCCESS)
- Sub-task 2: Search for deprecated patterns (SUCCESS)
Incomplete work:
- Sub-task 3: Run test suite (FAILED — Bash circuit OPEN)
- Sub-task 4: Update documentation (NOT ATTEMPTED — paused)
- Sub-task 5: Deploy to staging (NOT ATTEMPTED — paused)
Tool health:
Grep: CLOSED (healthy)
Read: CLOSED (healthy)
Edit: CLOSED (healthy)
Bash: OPEN (3 consecutive failures — permission/command/timeout)
WebSearch: OPEN (1 failure — connection refused)
Failures: 5 / 5 budget consumed
Recommendation:
1. Investigate Bash failures — likely environment issue
2. Check network connectivity for WebSearch
3. Resume from sub-task 4 after resolution
Pause-and-report = breaker in electrical systems → prevents damage accumulating. Agent that keeps calling broken tools wastes ctx, confuses user w/ repeat errs, inconsistent partial results.
→ Agent stops clean on budget exhaust. Report covers done work, incomplete work, tool health, actionable next steps.
If err: Can't generate clean report (state tracking lost) → out whatever avail. Partial report > silent continuation.
Step 8: Separation of Concerns — Expeditor vs. Executor
Valid. orchestration logic (Steps 2-7) cleanly separated from tool exec.
Expeditor (orch) does:
- Track tool health state
- Decide call, skip, probe
- Route to alts on open
- Enforce fail budget
- Generate status reports
Expeditor does NOT:
- Retry failed calls immediately
- Modify call params to work around errs
- Catch + suppress tool errs
- Assume why tool failed
- Exec fallback logic that itself needs tools
Expeditor "cooking" (calling tools to work around other fails) → separation broken. Expeditor routes to alt or shrinks scope, NOT fixes broken tool.
→ Clean boundary orch decisions vs. tool exec. Expeditor described w/o ref to specific tool APIs or err types.
If err: Orch + exec entangled → refactor → extract decision logic to separate step before each tool call. Decision step outs: CALL, SKIP, PROBE, PAUSE. Exec step acts on out.
Step 9: Detect Cascading Failures
Many tools share infra (network, fs, perms) → single root cause trips many breakers at once. Detect correlated pattern vs. treat each breaker indep.
Cascade indicators:
- 3+ tools OPEN in same task step / narrow window
- Fails share common err signature (e.g., "connection refused," "permission denied")
- Tools w/ indep fail history suddenly fail together
Response proc:
- Second breaker opens → check if fail category matches first
- Correlated → flag systemic failure → pause all tool calls, not just broken
- Report suspected root: "Multiple tools failing with [shared pattern] — likely [network/filesystem/permissions] issue"
- Don't probe half-open during systemic fail → probes also fail, waste budget
- Resume probing only after user confirms infra fixed
Backoff compounding: Cascade trips → exponential backoff for half-open probes: probe step 3, then 6, then 12. Cap max interval 20 steps → prevent permanent circuit lock. Stops rapid-fire probes overwhelming recovering system.
→ Correlated fails detected + treated as single systemic event, not N indep trips. Fail budget counts systemic event once, not N times.
If err: Correlation detection impractical (diff err sigs, shared cause) → fallback indep per-tool breakers. System still degrades → just consumes budget faster.
Step 10: Pre-Call Tool Selection Layer
Before circuit breaker loop (Step 3) → optionally valid. tool available + likely succeed. Cuts unnecessary trips from predictable fails.
Pre-call checks:
| Check | Method | Action on failure |
|---|---|---|
| Tool exists | Verify tool is in the allowed-tools list | Skip — do not even attempt |
| MCP server health | Check server process/connection status | Route to alternative immediately |
| Resource availability | Verify target file/URL/endpoint exists | Route or degrade scope |
Decision table:
Pre-call score:
AVAILABLE → proceed to circuit breaker loop (Step 3)
DEGRADED → proceed with caution, lower the failure threshold by 1
UNAVAILABLE → skip tool, route to alternative (Step 4) without consuming budget
Pre-call = advisory, not authoritative. Tool passing pre-call can still fail during exec. Breaker = primary reliability mechanism.
→ Predictable fails (missing tools, unreachable servers) caught before budget consumed. Breaker handles only genuine runtime fails.
If err: Pre-call checks unavail or too much overhead → skip entirely. Step 3 breaker loop handles all fails → pre-call = opt, not req.
Check
- Map covers all tools w/ alts + fallbacks documented
- Breaker state table init'd all tools
- Transitions follow CLOSED → OPEN → HALF-OPEN → CLOSED
- Fail threshold declared explicit (not implicit)
- Alt routing tried before scope reduction
- Scope reduction doc'd w/ deferred + reasons
- Stale data labeled explicit — never fresh
- Fail budget enforced w/ pause-and-report on exhaust
- Expeditor logic does NOT exec tool calls or retry fails
- Status report: done work, incomplete work, tool health
- No silent fails → every skip, deferral, degradation doc'd
- Cascade fails detected when 3+ tools open together
- Systemic fail pauses all probes until infra confirmed recovered
- Pre-call checks (if used) don't consume budget on predictable fails
Traps
- Retry vs. circuit-break: Calling broken tool repeat wastes budget + ctx. 3 consecutive fails = pattern, not bad luck. OPEN it.
- Cooking in expeditor: Orch decides what, not how to fix broken. Expeditor crafting workaround cmds for Bash fails → crossed the boundary.
- Silent scope reduction: Dropping sub-tasks w/o doc → looks complete, isn't. Always report skipped.
- Treat stale as fresh: Cached/prev results maybe not current. Label uncertainty, don't ignore.
- Open circuits too eagerly: Single transient fail shouldn't open. Threshold (default: 3) filters noise from signal.
- Never probe post-open: Permanently open → agent never finds recovery. Half-open probes essential.
- Ignore fail budget: W/o budget → agent accumulates dozens of fails across tools while "progressing" on paper. Budget forces honest checkpoint.
- Cascade backoff multiply: Many tools in dep chain each apply own exp backoff → compound delay grows multiplicatively. Cap total aggregate backoff across chain, not just per tool.
- Stale discovery scores: Step 10 caches tool avail. No invalidation when conditions change → agent skips recovered tool or attempts unavail. Re-check scores after systemic fail event.
→
fail-early-pattern— complementary: fail-early validates in before work; circuit-breaker manages fails during workescalate-issues— budget exhausted / scope reduction significant → escalate to specialist or humanwrite-incident-runbook— doc recurring fail patterns as runbooksassess-context— eval if current approach can adapt when many tools degraded; pairs w/ scope reductiondu-dum— two-clock arch sep'ng observe from decide; complement for cutting observe cost in agent loops
GitHub 저장소
연관 스킬
content-collections
메타이 스킬은 콘텐츠 콜렉션(Content Collections)을 위한 프로덕션 검증된 설정을 제공합니다. 콘텐츠 콜렉션은 Markdown/MDX 파일을 Zod 검증이 포함된 타입 안전한 데이터 콜렉션으로 변환해주는 TypeScript 최우선 도구입니다. 블로그, 문서 사이트 또는 콘텐츠 중심의 Vite + React 애플리케이션을 구축할 때 타입 안전성과 자동 콘텐츠 검증을 보장하기 위해 사용하세요. Vite 플러그인 구성과 MDX 컴파일부터 배포 최적화 및 스키마 검증에 이르기까지 모든 것을 다룹니다.
polymarket
메타이 스킬은 개발자들이 Polymarket 예측 시장 플랫폼을 활용한 애플리케이션을 구축할 수 있도록 지원하며, 거래 및 시장 데이터를 위한 API 통합 기능을 포함합니다. 또한 WebSocket을 통한 실시간 데이터 스트리밍을 제공하여 실시간 거래와 시장 활동을 모니터링할 수 있습니다. 이를 통해 거래 전략을 구현하거나 실시간 시장 업데이트를 처리하는 도구를 생성하는 데 활용할 수 있습니다.
creating-opencode-plugins
메타이 스킬은 개발자들이 명령어, 파일, LSP 작업 등 25개 이상의 이벤트 유형에 연결되는 OpenCode 플러그인을 만들 수 있도록 돕습니다. JavaScript/TypeScript 모듈을 위한 플러그인 구조, 이벤트 API 명세, 구현 패턴을 제공합니다. OpenCode AI 어시스턴트의 라이프사이클을 사용자 정의 이벤트 기반 로직으로 가로채거나, 모니터링하거나, 확장해야 할 때 사용하세요.
sglang
메타SGLang은 RadixAttention 프리픽스 캐싱을 활용하여 JSON, 정규식, 에이전트 워크플로우를 위한 고속 구조화 생성에 특화된 고성능 LLM 서빙 프레임워크입니다. 특히 반복되는 프리픽스가 있는 작업에서 상당히 빠른 추론 속도를 제공하여 복잡한 구조화 출력 및 다중 턴 대화에 이상적입니다. 제약 디코딩이 필요하거나 광범위한 프리픽스 공유가 있는 애플리케이션을 구축할 때는 vLLM과 같은 대안보다 SGLang을 선택하십시오.
