monitor-binary-version-baselines
정보
이 스킬은 API 엔드포인트와 기능 플래그 같은 분류된 마커들을 스캔하여 CLI 바이너리의 버전별 기준선을 설정합니다. 가중치 점수와 임계값을 사용해 릴리스 간 기능 존재를 감지하며, 기능의 라이프사이클을 추적하거나 숨겨진 기능을 발견합니다. 개발자는 이를 통해 스캔 도구를 검증하거나 시간에 따른 바이너리 내용의 변화를 모니터링할 수 있습니다.
빠른 설치
Claude Code
추천npx skills add pjt222/agent-almanac -a claude-code/plugin add https://github.com/pjt222/agent-almanacgit clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/monitor-binary-version-baselinesClaude Code에서 이 명령을 복사하여 붙여넣어 스킬을 설치하세요
문서
Monitor Binary Version Baselines
Build + maintain comparable, version-keyed records of which feature-system markers appear in CLI harness binary, so additions, removals, dark-launched capabilities can be detected mechanically across releases.
When Use
- Tracking feature's lifecycle across multiple releases of closed-source CLI harness
- Probing for dark-launched capabilities (shipped but gated off) or quietly-removed ones
- Verifying marker scanner still detects known-good markers on old binaries (regression-testing scanner itself)
- Building Phase 1 substrate that later phases (flag discovery, dark-launch detection, wire capture) consume
- Any context where ad-hoc
grepanswers "is X present today" but you actually need "how has system composed of X, Y, Z moved across versions"
Inputs
- Required: One or more installed binary versions of same CLI harness (or extracted bundles)
- Required: Working catalog file for marker definitions (created on first run, extended across versions)
- Optional: Previously-recorded baseline file from prior runs (extended in place, never rewritten)
- Optional: List of versions known to be never-published (skipped releases, withdrawn builds)
- Optional: List of feature-systems already under tracking, to extend rather than re-discover
Steps
Step 1: Select Markers by Category
Choose strings surviving rebuilds. Pick stable, semantically meaningful identifiers — not minified names bundler will rename next release.
Six recommended categories:
- API — endpoint paths, method names exposed in harness's network surface
- Identity — internal product names, codenames, version sentinels
- Config — recognized keys in user-facing configuration files
- Telemetry — event names emitted to analytics pipeline
- Flag — feature-gate keys consumed by gate predicates
- Function — well-known string constants used inside specific handlers (error messages, log labels)
Avoid: short identifiers looking minified (e.g., _a1, bX, two-letter names followed by digits), inline literals changing with any text revision, anything matching bundler's own internal naming convention.
Got: Each candidate marker has category tag + short justification ("appears in user-facing docs," "stable across N prior releases," etc.). Typical first pass yields 20-50 markers per system.
If fail: Markers vanish across consecutive minor versions? Catalog has captured rebuild-volatile strings rather than stable identifiers. Drop those entries; broaden to longer, more semantically anchored substrings.
Step 2: Group Markers by Feature-System
Bundle markers into one system table per independently-evolving capability. "System" is coherent set of markers whose presence/absence moves together because they share feature lifecycle (e.g., all markers belonging to hypothetical acme_widget_v3 capability).
Why grouping matters: per-system scoring prevents cross-contamination. Absence of one system's markers must not suppress detection of another, + aggregate counts across unrelated systems are uninformative.
Working catalog shape (pseudocode):
catalog:
acme_widget_v3:
markers:
- { id: "acme_widget_v3_init", category: function, weight: 10 }
- { id: "acme.widget.v3.dialog.open", category: telemetry, weight: 5 }
- { id: "ACME_WIDGET_V3_DISABLE", category: flag, weight: 10 }
acme_other_system:
markers:
- ...
Got: Each system has its own marker list; no marker appears in two systems. Adding new system means adding new top-level entry — never moving markers between systems retroactively.
If fail: Markers hard to assign to one system (overlap, ambiguity)? System definitions too coarse. Split system, or accept some markers are "shared substrate" + exclude them from per-system scoring.
Step 3: Weight Markers by Signal Strength
Assign each marker weight reflecting how much its presence alone confirms system:
- 10 = diagnostic-alone — unique enough that finding this marker, by itself, is sufficient to confirm system is present (e.g., long, system-specific string no other code path would emit)
- 3-5 = corroborating only — too generic to confirm alone, but contributes to aggregate score (e.g., short telemetry suffix harness reuses across features)
Teach convention, not specific numbers. Spread between "diagnostic" + "corroborating" matters more than exact integers chosen — what counts is thresholds in step 5 can distinguish "one strong signal" from "many weak signals."
Got: Each marker has weight. Catalog's weight distribution skews toward corroborating markers (3-5), with small number of diagnostic-alone markers (10) per system.
If fail: Every marker weighted 10? Scoring loses resolution — partial-presence findings become impossible. Demote markers recurring across multiple systems or appearing in unrelated handlers.
Step 4: Record Per-Version Baselines
For each version scanned, record both present and absent markers, keyed by version. Both are evidence: absent marker in version N is just as informative as present one when version N+1 reintroduces it.
Baseline shape:
baselines:
"1.4.0":
acme_widget_v3:
present: ["acme_widget_v3_init", "ACME_WIDGET_V3_DISABLE"]
absent: ["acme.widget.v3.dialog.open"]
score: 20
"1.5.0":
acme_widget_v3:
present: ["acme_widget_v3_init", "ACME_WIDGET_V3_DISABLE", "acme.widget.v3.dialog.open"]
absent: []
score: 25
"1.4.1":
_annotation: "never-published; skipped from upstream release timeline"
Never-published versions get explicit annotation rather than silent omission. Silently-skipped versions look like data loss to next reader.
Got: Every version produces one record per tracked system, with present, absent, score populated, or explicit _annotation if never-published.
If fail: Baseline scan yields zero markers for system previously present? Do not assume removal until you confirm binary path correct, strings command produced output, marker IDs match catalog exactly. False zeroes corrupt longitudinal record.
Step 5: Set Thresholds for Full + Partial Detection
Define two gates per system applied to aggregate score:
full— score above which system considered present-and-active in this versionpartial— score above which system considered shipped-but-incomplete (some markers present, but belowfullthreshold)
Below partial = absent (or not-yet-present, depending on direction of travel).
thresholds:
acme_widget_v3:
full: 25
partial: 10
Choosing thresholds: set full to sum of weights you would expect healthy install to emit; set partial to one diagnostic marker plus corroborating signal. Re-tune when you have several versions of evidence.
Got: Each scan produces labeled finding per system: full | partial | absent. Findings with partial warrant investigation — they are dark-launch + removal candidates.
If fail: Every system reports partial across every version? Thresholds too sensitive (likely set higher than markers can ever sum to). Recalibrate against known-good version where system verifiably live.
Step 6: Scan with strings -n 8
Use strings with minimum length filter as extraction primitive. -n 8 floor filters most noise (short fragments, padding, address-table junk) without losing meaningful identifiers, which are almost always longer than 8 characters.
strings -n 8 path/to/binary > /tmp/binary-strings.txt
Then run catalog match against /tmp/binary-strings.txt (any line-oriented matcher: grep -F -f markers.txt, ripgrep, or small script).
Caveats:
- Lower minimums (
-n 4,-n 6) flood output with binary garbage + minified-symbol noise; diagnostic-to-corroborating distinction collapses - Higher minimums (
-n 12+) miss short flag identifiers + config keys - Some bundlers compress or encode strings; if
stringsreturns near-empty output, binary may need bundle-extraction first (out of scope for this skill)
Got: Line-per-string output of 1k-100k lines, depending on binary size. Manual inspection should reveal recognizable identifiers in first 100 lines.
If fail: Output empty or unrecognizable? Binary probably packed, encrypted, or shipped as bytecode format strings cannot read. Stop + resolve at extraction layer; do not record baseline from unreadable scan.
Step 7: Extend Baselines Forward Without Rewriting Past Records
When new system or marker added to catalog, only forward versions are scanned for it. Past version records remain as originally written.
Why: prior-version baselines are empirical evidence of what was scanned at time, not current model of what past version contained. Retroactively rewriting them with newly-discovered markers conflates "what we know now" with "what we observed then." Both useful; only one should live in baseline file.
If retroactive scan genuinely needed (e.g., to test whether new marker was present in version N-3), record as separate addendum:
addenda:
"1.4.0":
scan_date: "2026-04-15"
catalog_revision: "v7"
findings:
acme_new_system:
present: ["..."]
Original baselines["1.4.0"] entry untouched. Reader can see both original record + later retroactive scan, with respective catalog revisions.
Got: Baseline file grows monotonically forward; past records are append-only with optional addenda blocks. Catalog revisions are versioned so each scan can be tied back to catalog state it used.
If fail: Ever feel urge to edit past version's present list directly? Stop. Add addendum instead. Mutating past records loses ability to detect scanner regressions (Step 8 of any later scanner-validation pass relies on historical record being immutable).
Checks
- Catalog has explicit category tags on every marker (one of API / identity / config / telemetry / flag / function)
- Every marker assigned to exactly one system; no marker appears in two systems
- Weights span real range (some 10s, some 3-5s); weights not all identical
- Each scanned version has record with
present,absent,scoreper tracked system - Never-published versions explicitly annotated, not silently omitted
- Each system has both
full+partialthresholds; findings labeled accordingly -
strings -n 8is extraction primitive (or documented equivalent for non-text binaries) - Past version records unchanged by latest scan; new findings live in addenda blocks if retroactive
Pitfalls
- Recording specific findings as the catalog. Catalog should describe marker categories + shapes, not enumerate version-pinned literals. Catalogs full of finding-shaped entries decay fast + are highest leak risk if accidentally published.
- Capturing minified identifiers. Names like
_p3aorq9Xrename on every rebuild. Even if they match today, they are noise tomorrow. Stay with semantically meaningful identifiers. - Conflating telemetry events with feature flags. They share naming conventions in many harnesses but play different roles. Tag them by category (Step 1) so per-category analysis stays clean.
- Silently skipping never-published versions. Gap in version sequence with no annotation looks like missed scan. Annotate explicitly:
_annotation: "never-published". - Setting thresholds before any baseline data exists. First scan establishes empirical weight totals; tune thresholds against that, not in advance.
- Rewriting prior version records when catalog grows. Past records are evidence; addenda are supported pattern for retroactive scans.
- Trusting empty scan output. Zero markers found does not always mean "absent." Confirm binary readable + catalog IDs match exactly before declaring removal.
- Treating
strings -n 4as more thorough than-n 8. Lower minimums add noise faster than signal. Diagnostic markers are essentially always 8+ characters.
See Also
security-audit-codebase— shared discipline; both pipelines treat marker presence as finding, with different downstream consumersaudit-dependency-versions— extends same version-tracking rigor to external dependency manifests; this skill applies it to binary artifactsprobe-feature-flag-state— Phase 2-3 follow-up; consumes baselines to classify flag rollout state (live / opt-in / dark / removed)conduct-empirical-wire-capture— Phase 4 follow-up; validates inferred behavior against actual harness trafficredact-for-public-disclosure— Phase 5 follow-up; governs which findings can leave private workspace
GitHub 저장소
연관 스킬
qmd
개발qmd는 BM25, 벡터 임베딩, 재순위화를 결합한 하이브리드 검색을 통해 로컬 파일을 색인화하고 검색할 수 있는 로컬 검색 및 색인화 CLI 도구입니다. 명령줄 사용과 Claude 통합을 위한 MCP(Model Context Protocol) 모드를 모두 지원합니다. 이 도구는 임베딩에 Ollama를 사용하고 색인을 로컬에 저장하여 터미널에서 직접 문서나 코드베이스를 검색하는 데 이상적입니다.
subagent-driven-development
개발이 스킬은 각 독립적인 작업마다 새로운 하위 에이전트를 배치하고 작업 사이에 코드 리뷰를 진행하여 구현 계획을 실행합니다. 이 리뷰 프로세스를 통해 품질 게이트를 유지하면서 빠른 반복 작업을 가능하게 합니다. 동일한 세션 내에서 대부분 독립적인 작업을 진행할 때 내장된 품질 검증과 함께 지속적인 진행을 보장하기 위해 사용하세요.
mcporter
개발mcporter 스킬은 개발자가 Claude에서 직접 Model Context Protocol(MCP) 서버를 관리하고 호출할 수 있도록 합니다. 이 스킬은 사용 가능한 서버를 나열하고, 인수를 사용해 해당 서버의 도구를 호출하며, 인증 및 데몬 생명주기를 처리하는 명령어를 제공합니다. 개발 워크플로우에서 MCP 서버 기능을 통합하고 테스트할 때 이 스킬을 사용하세요.
adk-deployment-specialist
개발이 스킬은 A2A 프로토콜을 사용하여 Vertex AI ADK 에이전트를 배포하고 오케스트레이션하며, AgentCard 검색, 작업 제출, 코드 실행 샌드박스 및 메모리 뱅크와 같은 지원 도구를 관리합니다. Python, Java 또는 Go 언어로 순차, 병렬 또는 루프 오케스트레이션 패턴을 갖춘 다중 에이전트 시스템 구축을 가능하게 합니다. Google Cloud에서 ADK 에이전트 배포 또는 에이전트 워크플로우 오케스트레이션을 요청받았을 때 사용하세요.
