manage-token-budget
关于
This skill provides tools to monitor and control token usage in long-running AI agent workflows. It implements cost tracking, budget enforcement, and emergency context pruning to prevent excessive accumulation. Use it to add cost guardrails and prevent API spending spikes in autonomous agent systems.
快速安装
Claude Code
推荐npx skills add pjt222/agent-almanac -a claude-code/plugin add https://github.com/pjt222/agent-almanacgit clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/manage-token-budget在 Claude Code 中复制并粘贴此命令以安装该技能
技能文档
Manage Token Budget
Control cost + context footprint of agentic systems. Track tokens per cycle, audit what consumes context, enforce budget caps, prune low-value context under pressure, route through metadata before loading full procedures. Core principle: every token in context window must earn its place. Tokens informing decisions stay; tokens occupying space without influencing output get pruned.
Community evidence: 37-hour autonomous session cost $13.74 from 30-minute heartbeat interval combined with verbose system instructions + unchecked context accumulation. Fix: rewrote heartbeat to 4-hour intervals, switched to notification-only mode, eliminated feed browsing from loop. This skill codifies patterns preventing such incidents.
When Use
- Running long-lived agent loops (heartbeats, polling, autonomous workflows) where costs compound
- Context windows growing unpredictably between cycles
- API costs spiked beyond baselines, post-mortem needed
- Designing new agentic workflow, want cost guardrails built in from start
- After cost incident, audit what went wrong, prevent recurrence
- System prompts, memory files, tool schemas grown large enough to dominate context window
Inputs
- Required: Agentic system or workflow to budget (running or planned)
- Required: Budget ceiling (dollar per period, or token limit per cycle)
- Optional: Current cost data (API logs, billing dashboard exports)
- Optional: Context window size of target model (default: check model docs)
- Optional: Acceptable degradation policy (what can drop when limits hit)
Steps
Step 1: Set Per-Cycle Cost Tracking
Instrument agent loop to log token usage at every execution boundary.
For each cycle (heartbeat, poll, task), capture:
- Input tokens: system prompt + memory + tool schemas + conversation history + new user/system content
- Output tokens: model response including tool calls
- Total cost: input tokens x input price + output tokens x output price
- Cycle timestamp: when cycle ran
- Cycle trigger: what initiated (timer, event, user action)
Store these in structured log (JSON lines, CSV, database) — not in context window itself:
{"cycle": 47, "ts": "2026-03-12T14:30:00Z", "trigger": "heartbeat",
"input_tokens": 18420, "output_tokens": 2105, "cost_usd": 0.0891,
"cumulative_cost_usd": 3.42}
If system has no instrumentation, estimate from API billing:
- Total cost / number of cycles = average cost per cycle
- Compare against expected baseline (model pricing x expected context size)
Got: Log shows per-cycle token counts + costs, with granularity to identify expensive cycles + why. Log lives outside context window.
If fail: Exact token counts unavailable (some APIs do not return usage metadata)? Use billing dashboard for averages. Even coarse tracking (daily cost / daily cycle count) reveals trends. No tracking possible? Go to Step 2, work from context audit — estimate costs from context size.
Step 2: Audit Context Window
Measure what occupies context window, rank consumers by size.
Decompose context into components, measure each:
- System prompt: base instructions, CLAUDE.md content, personality directives
- Memory: MEMORY.md, topic files loaded via auto-memory
- Tool schemas: MCP server tool definitions, function calling schemas
- Skill procedures: full SKILL.md content loaded for active skills
- Conversation history: prior turns in current session
- Dynamic content: tool outputs, file contents, search results from current cycle
Produce context budget table:
Context Budget Audit:
+------------------------+--------+------+-----------------------------------+
| Component | Tokens | % | Notes |
+------------------------+--------+------+-----------------------------------+
| System prompt | 4,200 | 21% | Includes CLAUDE.md chain |
| Memory (auto-loaded) | 3,800 | 19% | MEMORY.md + 4 topic files |
| Tool schemas | 2,600 | 13% | 3 MCP servers, 47 tools |
| Active skill procedure | 1,900 | 9% | Full SKILL.md loaded |
| Conversation history | 5,100 | 25% | 12 prior turns |
| Current cycle content | 2,400 | 12% | Tool outputs from this cycle |
+------------------------+--------+------+-----------------------------------+
| TOTAL | 20,000 | 100% | Model limit: 200,000 |
| Remaining headroom |180,000 | | |
+------------------------+--------+------+-----------------------------------+
Flag components disproportionately large vs decision-making value. 4,000-token memory file current task never references is pure overhead.
Got: Ranked table shows each context consumer, size, percent of window. At least one component stands out as candidate for reduction — most often conversation history or verbose tool outputs.
If fail: Exact token counts per component hard to get? Use character count / 4 as rough approximation for English. For structured data (JSON, YAML), character count / 3. Goal is relative ranking, not exact measurement.
Step 3: Set Budget Caps with Enforcement Policies
Define hard + soft limits, specify what happens when each reached.
-
Soft limit (warning threshold): typically 60-75% of hard limit. When hit:
- Log warning with current usage + remaining budget
- Begin voluntary pruning (Step 4) on lowest-value context
- Reduce cycle frequency if applicable (e.g., heartbeat interval 30min → 2h)
- Continue with degraded context
-
Hard limit (stop threshold): absolute max spend or context size. When hit:
- Halt autonomous operation immediately
- Send alert to human operator (notification, email, log entry)
- Preserve summary of current state for resumption
- Do not start another cycle until human reviews + authorizes
-
Per-cycle cap: max tokens or cost for any single cycle. Prevents single runaway cycle from consuming whole budget:
- If cycle would exceed cap, truncate tool outputs or skip low-priority actions
- Log truncation for post-mortem
Document caps in workflow config:
token_budget:
soft_limit_usd: 5.00 # warn and begin pruning
hard_limit_usd: 10.00 # halt and alert
per_cycle_cap_usd: 0.50 # max per individual cycle
soft_limit_pct: 70 # % of context window triggering pruning
hard_limit_pct: 90 # % of context window triggering halt
enforcement: strict # strict = halt on hard limit; advisory = log only
alert_channel: notification # how to notify the operator
Got: Documented budget caps at three levels (soft, hard, per-cycle) with explicit enforcement actions. Policy answers "what happens when we hit the limit?" before limit hit.
If fail: Setting precise dollar limits premature (new workflow, unknown cost profile)? Start with context-percentage limits only (soft 70%, hard 90%), add dollar limits after 24-48 hours of cost tracking. Advisory mode (log not halt) acceptable during calibration.
Step 4: Implement Emergency Pruning
When approaching limits, drop low-value context to stay within budget.
Pruning priority order (drop lowest-value first):
- Old tool outputs: verbose search results, file contents, API responses from previous cycles informing decisions already made. Decision persists; evidence can go.
- Redundant conversation turns: early turns superseded by later corrections or refinements. Turn 3 asked X, turn 7 revised to Y, turn 3 redundant.
- Verbose formatting: tables, ASCII art, decorative headers in tool outputs. Summarize with one-line description of what output contained.
- Completed sub-task context: for multi-step tasks, context from sub-tasks fully complete with outputs captured in summary or file.
- Inactive skill procedures: if skill loaded for previous step but no longer being followed, full procedure text can drop.
- Memory sections irrelevant to current task: auto-loaded memory about unrelated projects or past sessions.
For each pruned item, preserve one-line tombstone:
[PRUNED: 2,400 tokens of npm audit output from cycle 12 — 3 vulnerabilities found, all patched]
Tombstone costs ~20 tokens but preserves decision-relevant conclusion.
Got: Context window usage drops below soft limit after pruning. Each pruned item has tombstone preserving conclusion. No decision-critical info lost — only evidence behind already-made decisions.
If fail: Pruning to priority level 4 still leaves usage above soft limit? Workflow fundamentally too context-heavy for current cycle frequency. Escalate to human: "Context usage at N% after pruning. Options: (a) increase cycle interval, (b) reduce scope per cycle, (c) split into sub-workflows, (d) accept higher cost."
Step 5: Integrate Progressive Disclosure for Skill Loading
Route through registry metadata before loading full skill procedures — spend tokens on routing, not on reading.
Pattern:
- Route first: When task requires skill, read skill's registry entry (id, description, domain, complexity, tags) from
_registry.yml— roughly 3-5 lines, ~50 tokens - Confirm relevance: Does registry description match current need? If not, check next candidate. Costs ~50 tokens per miss vs ~500-2000 tokens for loading wrong SKILL.md
- Load on match: Only when registry entry confirms relevance, load full SKILL.md procedure
- Unload after use: Once skill procedure done, full text can be pruned (Step 4, priority 5) — keep only summary of what was done
Apply same pattern to other large context payloads:
- Memory files: Read MEMORY.md index lines first; load topic files only when topic relevant
- Tool documentation: Use tool names + one-line descriptions for routing; load full schemas only for tools being called
- File contents: Read file listings + function signatures first; load full file contents only for functions being modified
Without progressive disclosure:
Load 5 candidate skills → 5 × 1,500 tokens = 7,500 tokens → use 1 skill
With progressive disclosure:
Route through 5 registry entries → 5 × 50 tokens = 250 tokens
Load 1 matched skill → 1 × 1,500 tokens = 1,500 tokens
Total: 1,750 tokens (77% reduction)
Got: Skill loading follows two-phase pattern: lightweight routing via metadata, then full loading only on confirmed match. Same pattern applied to memory, tool schemas, file contents where applicable.
If fail: Registry metadata insufficient for routing (descriptions too vague, tags missing)? Improve registry entries rather than abandoning progressive disclosure. Fix is better metadata, not more context loading.
Step 6: Design Cost-Aware Cycle Intervals
Set execution intervals based on cost data, not arbitrary schedules.
-
Calculate cost-per-hour at current cycle interval:
cost_per_hour = avg_cost_per_cycle × cycles_per_hour- Example: $0.09/cycle at 2 cycles/hour = $0.18/hour = $4.32/day
-
Compare against budget:
hours_until_hard_limit = (hard_limit - cumulative_cost) / cost_per_hour- If hours_until_hard_limit < intended runtime, extend cycle interval
-
Determine minimum effective interval:
- Fastest rate of change in monitored system? If data source updates every 4 hours, polling every 30 minutes wastes 7 of 8 cycles
- Match cycle interval to data refresh rate, not anxiety about missing events
- Event-driven systems: replace polling with webhooks or push notifications where possible
-
Apply interval:
Before: 30-minute heartbeat, verbose processing
→ 48 cycles/day × $0.09/cycle = $4.32/day
After: 4-hour heartbeat, notification-only
→ 6 cycles/day × $0.04/cycle = $0.24/day
→ 94% cost reduction
Got: Cycle interval justified by cost data, matches monitored system refresh rate. Interval-cost tradeoff documented so future adjustments have baseline.
If fail: System requires low-latency response, cannot tolerate longer intervals? Reduce per-cycle cost instead (smaller system prompts, fewer tool schemas loaded, summarized history). Budget equation has two levers: frequency + cost-per-cycle.
Step 7: Validate Budget Controls
Confirm all controls working, system operates within budget.
- Tracking validation: Run 3-5 cycles, verify per-cycle logs written with accurate token counts
- Soft limit test: Temporarily lower soft limit, verify warning fires + pruning begins
- Hard limit test: Temporarily lower hard limit, verify system halts + alerts
- Per-cycle cap test: Inject large tool output, verify it gets truncated rather than blowing cap
- Progressive disclosure test: Trace skill-loading sequence, confirm it routes through registry before loading full SKILL.md
- Cost projection: From validation data, project:
- Daily cost at current settings
- Days until hard limit at current burn rate
- Expected monthly cost
Budget Validation Report:
+-----------------------+----------+--------+
| Check | Expected | Actual |
+-----------------------+----------+--------+
| Per-cycle logging | Present | |
| Soft limit warning | Fires | |
| Hard limit halt | Halts | |
| Per-cycle cap | Truncates| |
| Progressive disclosure| Routes | |
| Daily cost projection | < $X.XX | |
+-----------------------+----------+--------+
Got: All five controls (tracking, soft limit, hard limit, per-cycle cap, progressive disclosure) verified working. Cost projection within intended budget.
If fail: Controls not firing? Check enforcement mechanism wired into actual execution loop, not just documented. Configuration without enforcement is plan, not control. Cost projection exceeds budget? Return to Step 6, adjust cycle interval or per-cycle cost.
Checks
- Per-cycle cost tracking logs input tokens, output tokens, cost, timestamp every cycle
- Context window audit identifies all consumers with approximate token counts + percentages
- Budget caps defined at three levels: soft limit, hard limit, per-cycle cap
- Each cap has explicit enforcement action (warn, prune, halt, alert)
- Emergency pruning follows priority order, preserves tombstones
- Progressive disclosure routes through metadata before loading full content
- Cycle interval justified by cost data, matches monitored system refresh rate
- Validation tests confirm all controls fire correctly
- Cost projection within defined budget
- Post-incident: root cause identified, specific prevention measure in place
Pitfalls
- Tracking in context window: Storing per-cycle logs inside conversation history inflates the very thing you trying to control. Log externally (file, DB, API), keep only current summary in context.
- Soft limits without enforcement: Warning nobody sees is not a control. Soft limits must trigger visible action — pruning, interval extension, operator notification. If system can silently exceed soft limit, it will.
- Pruning decisions over data: Dropping tool outputs before decisions made loses information. Prune evidence AFTER decision it informed, not before. Tombstone pattern preserves conclusions while dropping evidence.
- Matching cycle interval to anxiety, not data refresh: Polling source every 30 minutes when it updates every 4 hours wastes 87.5% of cycles. Measure data source actual refresh rate before setting interval.
- Loading full skills for routing: Reading 400-line SKILL.md to decide "is this right skill?" costs 10-20x more than reading 3-line registry entry. Route through metadata first; load procedure only on confirmed match.
- Ignoring system prompt: System prompts, CLAUDE.md chains, auto-loaded memory are invisible costs — paid every single cycle. 5,000-token system prompt in 48-cycle/day loop costs 240,000 input tokens/day just for instructions. Audit + trim these first.
- Budget caps without human escalation: Autonomous systems hitting budget limits + silently degrading (instead of alerting human) can accumulate damage. Hard limits must include human notification channel.
See Also
assess-context— evaluate reasoning context for structural health; complements context window audit in Step 2metal— extract conceptual essence from codebases; progressive disclosure pattern applies to metal's prospect phasechrysopoeia— value extraction + dead weight elimination; applies same value-per-token thinking at code levelmanage-memory— organize + prune persistent memory files; directly reduces memory component of context budgets
GitHub 仓库
相关推荐技能
executing-plans
设计该Skill用于当开发者提供完整实施计划时,以受控批次方式执行代码实现。它会先审阅计划并提出疑问,然后分批次执行任务(默认每批3个任务),并在批次间暂停等待审查。关键特性包括分批次执行、内置检查点和架构师审查机制,确保复杂系统实现的可控性。
requesting-code-review
设计该Skill可在完成任务、实现主要功能或合并代码前自动调度代码审查子代理,确保实现符合需求和计划。它支持通过指定git SHA范围进行精准的代码变更审查,帮助开发者在关键节点及时发现潜在问题。核心原则是"早审查、勤审查",适用于开发流程的各个关键阶段。
connect-mcp-server
设计这个Skill指导开发者如何将MCP服务器连接到Claude Code,支持HTTP、stdio和SSE三种传输协议。它涵盖了从安装配置到认证安全的完整流程,适用于集成GitHub、Notion、数据库等外部服务。当开发者需要添加集成、配置外部工具或提及MCP相关功能时,这个Skill能提供实用的操作指南。
web-cli-teleport
设计该Skill帮助开发者根据任务特性选择Claude Code的Web或CLI界面,并指导如何在两种环境间无缝迁移会话。它能分析任务复杂度、迭代需求等要素,推荐最优工作界面和工作流。关键特性包括会话状态管理、环境切换指导和上下文优化建议。
