monitor-binary-version-baselines
について
このスキルは、APIやフラグなどの分類されたマーカーを検出し、重み付けスコアリングを用いてベースラインを確立することで、バージョン間のCLIバイナリ変更を追跡します。開発者が機能ライフサイクルを監視し、隠れた機能を発見し、過去のバイナリに対するスキャンツールを検証するのに役立ちます。リリース間のバイナリコンテンツ進化の縦断的分析にご活用ください。
クイックインストール
Claude Code
推奨npx skills add pjt222/agent-almanac -a claude-code/plugin add https://github.com/pjt222/agent-almanacgit clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/monitor-binary-version-baselinesこのコマンドをClaude Codeにコピー&ペーストしてスキルをインストールします
ドキュメント
Monitor Binary Version Baselines
Build + maintain comparable, version-keyed records of feature-system markers in CLI harness binary → additions/removals/dark-launched detected mechanically across releases.
Use When
- Track feature lifecycle across releases of closed-source CLI harness
- Probe dark-launched (shipped but gated off) or quietly-removed
- Verify scanner still detects known-good markers on old binaries (regression-test scanner)
- Phase 1 substrate consumed by later phases (flag discovery, dark-launch detection, wire capture)
- Ad-hoc
grepanswers "X present today" but you need "how has system X+Y+Z moved across versions"
In
- Required: ≥1 installed binary versions of same CLI harness (or extracted bundles)
- Required: Working catalog file for marker defs (created on first run, extended)
- Optional: Prior baseline file (extended in place, never rewritten)
- Optional: List of never-published versions (skipped, withdrawn)
- Optional: List of feature-systems already tracked, to extend not re-discover
Do
Step 1: Markers by Category
Pick stable semantically meaningful identifiers — not minified names bundler will rename next release.
Six categories:
- API — endpoint paths, method names exposed in network surface
- Identity — internal product names, codenames, version sentinels
- Config — recognized keys in user-facing config files
- Telemetry — event names emitted to analytics pipeline
- Flag — feature-gate keys consumed by gate predicates
- Function — well-known string constants in handlers (err msgs, log labels)
Avoid: short minified-looking (_a1, bX, two-letter+digits), inline literals that change w/ text revision, anything matching bundler's internal naming.
→ Each candidate has category tag + short justification ("appears in user-facing docs," "stable across N prior releases"). Typical first pass: 20-50 markers per system.
If err: markers vanish across consecutive minor versions → catalog captured rebuild-volatile not stable. Drop entries, broaden to longer semantically anchored substrings.
Step 2: Group by Feature-System
Bundle markers per system table for independently-evolving capability. "System" = coherent marker set whose presence/absence moves together (shared lifecycle).
Why: per-system scoring prevents cross-contamination. Absence of one system's markers must not suppress detection of another. Aggregate counts across unrelated systems = uninformative.
Working catalog shape (pseudocode):
catalog:
acme_widget_v3:
markers:
- { id: "acme_widget_v3_init", category: function, weight: 10 }
- { id: "acme.widget.v3.dialog.open", category: telemetry, weight: 5 }
- { id: "ACME_WIDGET_V3_DISABLE", category: flag, weight: 10 }
acme_other_system:
markers:
- ...
→ Each system has own marker list; no marker in two systems. New system = new top-level entry, never move retroactively.
If err: hard to assign (overlap, ambiguity) → defs too coarse. Split system or accept "shared substrate", exclude from per-system scoring.
Step 3: Weight by Signal Strength
Per marker:
- 10 = diagnostic-alone — unique enough alone confirms (long, system-specific, no other code path emits)
- 3-5 = corroborating only — too generic alone, contributes to aggregate (short telemetry suffix reused)
Convention not specific numbers. Spread "diagnostic" vs. "corroborating" matters more than exact integers. Thresholds Step 5 must distinguish "one strong signal" from "many weak signals."
→ Each marker weighted. Distribution skews toward corroborating (3-5), small number diagnostic-alone (10) per system.
If err: every marker = 10 → scoring loses resolution, partial-presence impossible. Demote markers recurring across systems or in unrelated handlers.
Step 4: Per-Version Baselines
Per scanned version, record both present + absent keyed by version. Both = evidence: absent in N is informative when N+1 reintroduces.
Shape:
baselines:
"1.4.0":
acme_widget_v3:
present: ["acme_widget_v3_init", "ACME_WIDGET_V3_DISABLE"]
absent: ["acme.widget.v3.dialog.open"]
score: 20
"1.5.0":
acme_widget_v3:
present: ["acme_widget_v3_init", "ACME_WIDGET_V3_DISABLE", "acme.widget.v3.dialog.open"]
absent: []
score: 25
"1.4.1":
_annotation: "never-published; skipped from upstream release timeline"
Never-published get explicit annotation, not silent omission. Silent skips look like data loss to next reader.
→ Every version has one record per tracked system: present, absent, score populated, or explicit _annotation if never-published.
If err: scan yields zero markers for previously-present system → don't assume removal until you confirm binary path correct, strings cmd produced output, marker IDs match catalog exactly. False zeroes corrupt longitudinal record.
Step 5: Thresholds for Full + Partial Detection
Two gates per system on aggregate score:
full— score above which system = present-and-active in this versionpartial— score above which system = shipped-but-incomplete (some markers, belowfull)
Below partial = absent (or not-yet-present, depending on direction).
thresholds:
acme_widget_v3:
full: 25
partial: 10
Choose: full = sum of weights healthy install would emit; partial = one diagnostic + corroborating signal. Re-tune w/ several versions evidence.
→ Each scan: per-system finding full | partial | absent. partial warrants investigation — dark-launch + removal candidates.
If err: every system reports partial across every version → thresholds too sensitive (set higher than markers can sum). Recalibrate vs. known-good version where system verifiably live.
Step 6: Scan w/ strings -n 8
strings w/ min length filter as extraction primitive. -n 8 floor filters most noise (short fragments, padding, address-table junk) w/o losing meaningful identifiers (almost always >8 chars).
strings -n 8 path/to/binary > /tmp/binary-strings.txt
Then catalog match vs. /tmp/binary-strings.txt (any line-oriented matcher: grep -F -f markers.txt, ripgrep, small script).
Caveats:
- Lower (
-n 4,-n 6) → flood w/ binary garbage + minified-symbol noise; diagnostic/corroborating collapses - Higher (
-n 12+) → miss short flag identifiers + config keys - Some bundlers compress/encode → near-empty output → may need bundle-extraction first (out of scope)
→ Line-per-string out 1k-100k lines depending on binary size. Manual inspection reveals recognizable identifiers in first 100 lines.
If err: empty/unrecognizable → binary likely packed, encrypted, or bytecode format strings can't read. Stop, resolve at extraction layer. Don't record baseline from unreadable scan.
Step 7: Extend Forward, Don't Rewrite Past Records
New system or marker added to catalog → only forward versions scanned. Past records remain as originally written.
Why: prior baselines = empirical evidence of what was scanned at the time, not current model of what past version contained. Retroactive rewriting conflates "what we know now" w/ "what we observed then." Both useful, only one in baseline file.
Retroactive scan genuinely needed (test if new marker was in N-3) → record as separate addendum:
addenda:
"1.4.0":
scan_date: "2026-04-15"
catalog_revision: "v7"
findings:
acme_new_system:
present: ["..."]
Original baselines["1.4.0"] untouched. Reader sees both original + later retroactive, w/ respective catalog revisions.
→ Baseline grows monotonically forward; past records append-only w/ optional addenda. Catalog revisions versioned so each scan tied back to catalog state used.
If err: urge to edit past version's present list directly → stop. Add addendum. Mutating past records loses ability to detect scanner regressions (later scanner-validation relies on historical record being immutable).
Check
- Catalog has explicit category tags every marker (API/identity/config/telemetry/flag/function)
- Every marker assigned to exactly one system; no duplicates
- Weights span real range (some 10s, some 3-5); not all identical
- Each scanned version:
present+absent+scoreper tracked system - Never-published explicitly annotated, not silent omission
- Each system:
full+partialthresholds; findings labeled -
strings -n 8extraction primitive (or documented equivalent for non-text binaries) - Past version records unchanged by latest scan; new findings in addenda if retroactive
Traps
- Recording specific findings as catalog: Catalog should describe marker categories + shapes, not enumerate version-pinned literals. Finding-shaped entries decay fast + highest leak risk if published.
- Capturing minified identifiers:
_p3a,q9Xrename every rebuild. Match today = noise tomorrow. Stay w/ semantically meaningful. - Conflating telemetry vs. flags: Share naming conventions in many harnesses but different roles. Tag by category (Step 1) so per-category analysis stays clean.
- Silent skip never-published: Gap w/ no annotation looks like missed scan. Annotate:
_annotation: "never-published". - Thresholds before baseline data: First scan establishes empirical weight totals; tune vs. that, not in advance.
- Rewriting prior records when catalog grows: Past records = evidence; addenda = supported pattern for retroactive.
- Trust empty scan output: Zero found ≠ "absent." Confirm binary readable + catalog IDs match exactly before declaring removal.
strings -n 4more thorough than-n 8: Lower mins add noise faster than signal. Diagnostic markers essentially always 8+ chars.
→
security-audit-codebase— shared discipline; both treat marker presence as finding, different downstream consumersaudit-dependency-versions— extends same version-tracking rigor to external dep manifests; this applies to binary artifactsprobe-feature-flag-state— Phase 2-3 follow-up; consumes baselines to classify flag rollout state (live/opt-in/dark/removed)conduct-empirical-wire-capture— Phase 4 follow-up; validates inferred behavior vs. actual harness trafficredact-for-public-disclosure— Phase 5 follow-up; governs which findings can leave private workspace
GitHub リポジトリ
関連スキル
qmd
開発qmdは、BM25、ベクトル埋め込み、およびリランキングを組み合わせたハイブリッド検索を用いて、ローカルファイルのインデックス作成と検索を可能にするローカル検索・インデックス作成CLIツールです。コマンドラインでの使用と、Claudeとの統合のためのMCP(Model Context Protocol)モードの両方をサポートしています。このツールは埋め込みにOllamaを使用し、インデックスをローカルに保存するため、ターミナルから直接ドキュメントやコードベースを検索するのに最適です。
subagent-driven-development
開発このスキルは、各独立したタスクに対して新規のサブエージェントを起動し、タスク間でコードレビューを実施しながら実装計画を実行します。レビュープロセスを通じて品質基準を維持しつつ、迅速な反復を可能にします。同一セッション内で主に独立したタスクに取り組む際に本スキルをご利用いただくことで、組み込まれた品質チェックを伴う継続的な進捗を確保できます。
mcporter
開発mcporterスキルは、開発者がClaudeから直接Model Context Protocol(MCP)サーバーを管理および呼び出せるようにします。このスキルは、利用可能なサーバーの一覧表示、引数を指定したツールの呼び出し、認証およびデーモンのライフサイクル管理を行うコマンドを提供します。開発ワークフローにおいてMCPサーバーの機能を統合およびテストする際に、このスキルをご利用ください。
adk-deployment-specialist
開発このスキルは、A2Aプロトコルを使用してVertex AI ADKエージェントをデプロイおよびオーケストレーションし、AgentCardの発見、タスク送信、およびコード実行サンドボックスやメモリバンクなどのサポートツールを管理します。Python、Java、またはGoで、順次、並列、またはループのオーケストレーションパターンを用いたマルチエージェントシステムの構築を可能にします。Google Cloud上でADKエージェントのデプロイやエージェントワークフローのオーケストレーションを求められた際にご利用ください。
