SKILL·599E6F

manage-token-budget

Name: manage-token-budget
Author: pjt222

pjt222

更新于 1 month ago

9 次查看

设计aiapiautomationdesign

关于

This skill provides tools to monitor and control token usage in long-running AI agent workflows. It implements cost tracking, budget enforcement, and emergency context pruning to prevent excessive accumulation. Use it to add cost guardrails and prevent API spending spikes in autonomous agent systems.

快速安装

Claude Code

技能文档

Manage Token Budget

Control cost + context footprint of agentic systems. Track tokens per cycle, audit what consumes context, enforce budget caps, prune low-value context under pressure, route through metadata before loading full procedures. Core principle: every token in context window must earn its place. Tokens informing decisions stay; tokens occupying space without influencing output get pruned.

Community evidence: 37-hour autonomous session cost $13.74 from 30-minute heartbeat interval combined with verbose system instructions + unchecked context accumulation. Fix: rewrote heartbeat to 4-hour intervals, switched to notification-only mode, eliminated feed browsing from loop. This skill codifies patterns preventing such incidents.

When Use

Running long-lived agent loops (heartbeats, polling, autonomous workflows) where costs compound
Context windows growing unpredictably between cycles
API costs spiked beyond baselines, post-mortem needed
Designing new agentic workflow, want cost guardrails built in from start
After cost incident, audit what went wrong, prevent recurrence
System prompts, memory files, tool schemas grown large enough to dominate context window

Inputs

Required: Agentic system or workflow to budget (running or planned)
Required: Budget ceiling (dollar per period, or token limit per cycle)
Optional: Current cost data (API logs, billing dashboard exports)
Optional: Context window size of target model (default: check model docs)
Optional: Acceptable degradation policy (what can drop when limits hit)

Steps

Step 1: Set Per-Cycle Cost Tracking

Instrument agent loop to log token usage at every execution boundary.

For each cycle (heartbeat, poll, task), capture:

Input tokens: system prompt + memory + tool schemas + conversation history + new user/system content
Output tokens: model response including tool calls
Total cost: input tokens x input price + output tokens x output price
Cycle timestamp: when cycle ran
Cycle trigger: what initiated (timer, event, user action)

Store these in structured log (JSON lines, CSV, database) — not in context window itself:

{"cycle": 47, "ts": "2026-03-12T14:30:00Z", "trigger": "heartbeat",
 "input_tokens": 18420, "output_tokens": 2105, "cost_usd": 0.0891,
 "cumulative_cost_usd": 3.42}

If system has no instrumentation, estimate from API billing:

Total cost / number of cycles = average cost per cycle
Compare against expected baseline (model pricing x expected context size)

Got: Log shows per-cycle token counts + costs, with granularity to identify expensive cycles + why. Log lives outside context window.

If fail: Exact token counts unavailable (some APIs do not return usage metadata)? Use billing dashboard for averages. Even coarse tracking (daily cost / daily cycle count) reveals trends. No tracking possible? Go to Step 2, work from context audit — estimate costs from context size.

Step 2: Audit Context Window

Measure what occupies context window, rank consumers by size.

Decompose context into components, measure each:

System prompt: base instructions, CLAUDE.md content, personality directives
Memory: MEMORY.md, topic files loaded via auto-memory
Tool schemas: MCP server tool definitions, function calling schemas
Skill procedures: full SKILL.md content loaded for active skills
Conversation history: prior turns in current session
Dynamic content: tool outputs, file contents, search results from current cycle

Produce context budget table:

Context Budget Audit:
+------------------------+--------+------+-----------------------------------+
| Component              | Tokens | %    | Notes                             |
+------------------------+--------+------+-----------------------------------+
| System prompt          | 4,200  | 21%  | Includes CLAUDE.md chain          |
| Memory (auto-loaded)   | 3,800  | 19%  | MEMORY.md + 4 topic files         |
| Tool schemas           | 2,600  | 13%  | 3 MCP servers, 47 tools           |
| Active skill procedure | 1,900  |  9%  | Full SKILL.md loaded              |
| Conversation history   | 5,100  | 25%  | 12 prior turns                    |
| Current cycle content  | 2,400  | 12%  | Tool outputs from this cycle      |
+------------------------+--------+------+-----------------------------------+
| TOTAL                  | 20,000 | 100% | Model limit: 200,000             |
| Remaining headroom     |180,000 |      |                                   |
+------------------------+--------+------+-----------------------------------+

Flag components disproportionately large vs decision-making value. 4,000-token memory file current task never references is pure overhead.

Got: Ranked table shows each context consumer, size, percent of window. At least one component stands out as candidate for reduction — most often conversation history or verbose tool outputs.

If fail: Exact token counts per component hard to get? Use character count / 4 as rough approximation for English. For structured data (JSON, YAML), character count / 3. Goal is relative ranking, not exact measurement.

Step 3: Set Budget Caps with Enforcement Policies

Define hard + soft limits, specify what happens when each reached.

Soft limit (warning threshold): typically 60-75% of hard limit. When hit:
- Log warning with current usage + remaining budget
- Begin voluntary pruning (Step 4) on lowest-value context
- Reduce cycle frequency if applicable (e.g., heartbeat interval 30min → 2h)
- Continue with degraded context
Hard limit (stop threshold): absolute max spend or context size. When hit:
- Halt autonomous operation immediately
- Send alert to human operator (notification, email, log entry)
- Preserve summary of current state for resumption
- Do not start another cycle until human reviews + authorizes
Per-cycle cap: max tokens or cost for any single cycle. Prevents single runaway cycle from consuming whole budget:
- If cycle would exceed cap, truncate tool outputs or skip low-priority actions
- Log truncation for post-mortem

Document caps in workflow config:

token_budget:
  soft_limit_usd: 5.00        # warn and begin pruning
  hard_limit_usd: 10.00       # halt and alert
  per_cycle_cap_usd: 0.50     # max per individual cycle
  soft_limit_pct: 70           # % of context window triggering pruning
  hard_limit_pct: 90           # % of context window triggering halt
  enforcement: strict          # strict = halt on hard limit; advisory = log only
  alert_channel: notification  # how to notify the operator

Got: Documented budget caps at three levels (soft, hard, per-cycle) with explicit enforcement actions. Policy answers "what happens when we hit the limit?" before limit hit.

If fail: Setting precise dollar limits premature (new workflow, unknown cost profile)? Start with context-percentage limits only (soft 70%, hard 90%), add dollar limits after 24-48 hours of cost tracking. Advisory mode (log not halt) acceptable during calibration.

Step 4: Implement Emergency Pruning

When approaching limits, drop low-value context to stay within budget.

Pruning priority order (drop lowest-value first):

Old tool outputs: verbose search results, file contents, API responses from previous cycles informing decisions already made. Decision persists; evidence can go.
Redundant conversation turns: early turns superseded by later corrections or refinements. Turn 3 asked X, turn 7 revised to Y, turn 3 redundant.
Verbose formatting: tables, ASCII art, decorative headers in tool outputs. Summarize with one-line description of what output contained.
Completed sub-task context: for multi-step tasks, context from sub-tasks fully complete with outputs captured in summary or file.
Inactive skill procedures: if skill loaded for previous step but no longer being followed, full procedure text can drop.
Memory sections irrelevant to current task: auto-loaded memory about unrelated projects or past sessions.

For each pruned item, preserve one-line tombstone:

[PRUNED: 2,400 tokens of npm audit output from cycle 12 — 3 vulnerabilities found, all patched]

Tombstone costs ~20 tokens but preserves decision-relevant conclusion.

Got: Context window usage drops below soft limit after pruning. Each pruned item has tombstone preserving conclusion. No decision-critical info lost — only evidence behind already-made decisions.

If fail: Pruning to priority level 4 still leaves usage above soft limit? Workflow fundamentally too context-heavy for current cycle frequency. Escalate to human: "Context usage at N% after pruning. Options: (a) increase cycle interval, (b) reduce scope per cycle, (c) split into sub-workflows, (d) accept higher cost."

Step 5: Integrate Progressive Disclosure for Skill Loading

Route through registry metadata before loading full skill procedures — spend tokens on routing, not on reading.

Pattern:

Route first: When task requires skill, read skill's registry entry (id, description, domain, complexity, tags) from _registry.yml — roughly 3-5 lines, ~50 tokens
Confirm relevance: Does registry description match current need? If not, check next candidate. Costs ~50 tokens per miss vs ~500-2000 tokens for loading wrong SKILL.md
Load on match: Only when registry entry confirms relevance, load full SKILL.md procedure
Unload after use: Once skill procedure done, full text can be pruned (Step 4, priority 5) — keep only summary of what was done

Apply same pattern to other large context payloads:

Memory files: Read MEMORY.md index lines first; load topic files only when topic relevant
Tool documentation: Use tool names + one-line descriptions for routing; load full schemas only for tools being called
File contents: Read file listings + function signatures first; load full file contents only for functions being modified

Without progressive disclosure:
  Load 5 candidate skills → 5 × 1,500 tokens = 7,500 tokens → use 1 skill

With progressive disclosure:
  Route through 5 registry entries → 5 × 50 tokens = 250 tokens
  Load 1 matched skill → 1 × 1,500 tokens = 1,500 tokens
  Total: 1,750 tokens (77% reduction)

Got: Skill loading follows two-phase pattern: lightweight routing via metadata, then full loading only on confirmed match. Same pattern applied to memory, tool schemas, file contents where applicable.

If fail: Registry metadata insufficient for routing (descriptions too vague, tags missing)? Improve registry entries rather than abandoning progressive disclosure. Fix is better metadata, not more context loading.

Step 6: Design Cost-Aware Cycle Intervals

Set execution intervals based on cost data, not arbitrary schedules.

Calculate cost-per-hour at current cycle interval:
- cost_per_hour = avg_cost_per_cycle × cycles_per_hour
- Example: $0.09/cycle at 2 cycles/hour = $0.18/hour = $4.32/day
Compare against budget:
- hours_until_hard_limit = (hard_limit - cumulative_cost) / cost_per_hour
- If hours_until_hard_limit < intended runtime, extend cycle interval
Determine minimum effective interval:
- Fastest rate of change in monitored system? If data source updates every 4 hours, polling every 30 minutes wastes 7 of 8 cycles
- Match cycle interval to data refresh rate, not anxiety about missing events
- Event-driven systems: replace polling with webhooks or push notifications where possible
Apply interval:

Before: 30-minute heartbeat, verbose processing
  → 48 cycles/day × $0.09/cycle = $4.32/day

After: 4-hour heartbeat, notification-only
  → 6 cycles/day × $0.04/cycle = $0.24/day
  → 94% cost reduction

Got: Cycle interval justified by cost data, matches monitored system refresh rate. Interval-cost tradeoff documented so future adjustments have baseline.

If fail: System requires low-latency response, cannot tolerate longer intervals? Reduce per-cycle cost instead (smaller system prompts, fewer tool schemas loaded, summarized history). Budget equation has two levers: frequency + cost-per-cycle.

Step 7: Validate Budget Controls

Confirm all controls working, system operates within budget.

Tracking validation: Run 3-5 cycles, verify per-cycle logs written with accurate token counts
Soft limit test: Temporarily lower soft limit, verify warning fires + pruning begins
Hard limit test: Temporarily lower hard limit, verify system halts + alerts
Per-cycle cap test: Inject large tool output, verify it gets truncated rather than blowing cap
Progressive disclosure test: Trace skill-loading sequence, confirm it routes through registry before loading full SKILL.md
Cost projection: From validation data, project:
- Daily cost at current settings
- Days until hard limit at current burn rate
- Expected monthly cost

Budget Validation Report:
+-----------------------+----------+--------+
| Check                 | Expected | Actual |
+-----------------------+----------+--------+
| Per-cycle logging     | Present  |        |
| Soft limit warning    | Fires    |        |
| Hard limit halt       | Halts    |        |
| Per-cycle cap         | Truncates|        |
| Progressive disclosure| Routes   |        |
| Daily cost projection | < $X.XX  |        |
+-----------------------+----------+--------+

Got: All five controls (tracking, soft limit, hard limit, per-cycle cap, progressive disclosure) verified working. Cost projection within intended budget.

If fail: Controls not firing? Check enforcement mechanism wired into actual execution loop, not just documented. Configuration without enforcement is plan, not control. Cost projection exceeds budget? Return to Step 6, adjust cycle interval or per-cycle cost.

Checks

Pitfalls

Tracking in context window: Storing per-cycle logs inside conversation history inflates the very thing you trying to control. Log externally (file, DB, API), keep only current summary in context.
Soft limits without enforcement: Warning nobody sees is not a control. Soft limits must trigger visible action — pruning, interval extension, operator notification. If system can silently exceed soft limit, it will.
Pruning decisions over data: Dropping tool outputs before decisions made loses information. Prune evidence AFTER decision it informed, not before. Tombstone pattern preserves conclusions while dropping evidence.
Matching cycle interval to anxiety, not data refresh: Polling source every 30 minutes when it updates every 4 hours wastes 87.5% of cycles. Measure data source actual refresh rate before setting interval.
Loading full skills for routing: Reading 400-line SKILL.md to decide "is this right skill?" costs 10-20x more than reading 3-line registry entry. Route through metadata first; load procedure only on confirmed match.
Ignoring system prompt: System prompts, CLAUDE.md chains, auto-loaded memory are invisible costs — paid every single cycle. 5,000-token system prompt in 48-cycle/day loop costs 240,000 input tokens/day just for instructions. Audit + trim these first.
Budget caps without human escalation: Autonomous systems hitting budget limits + silently degrading (instead of alerting human) can accumulate damage. Hard limits must include human notification channel.

GitHub 仓库

pjt222/agent-almanac

路径: i18n/caveman/skills/manage-token-budget

agentsagentskillsai-assisted-developmentclaude-codeskillsteams

FAQ

Frequently asked questions

What is the manage-token-budget skill?

manage-token-budget is a Claude Skill by pjt222. Skills package instructions and resources that Claude loads on demand, so Claude can perform manage-token-budget-related tasks without extra prompting.

How do I install manage-token-budget?

Use the install commands on this page: add manage-token-budget to Claude Code as a plugin, or clone its repository into your skills directory, then restart Claude so it picks up the skill.

What category does manage-token-budget belong to?

manage-token-budget is in the Design category, tagged ai, api, automation and design.

Is manage-token-budget free to use?

Yes. manage-token-budget is listed on AIMCP and free to install. It runs inside Claude, so no separate service account is required to use the skill itself.