MCP HubMCP Hub
스킬 목록으로 돌아가기

monitor-binary-version-baselines

pjt222
업데이트됨 2 days ago
5 조회
17
2
17
GitHub에서 보기
개발aiapi

정보

이 스킬은 API와 플래그 같은 분류된 마커를 감지하여 가중치 점수 체계로 기준선을 설정함으로써 CLI 바이너리의 버전별 변화를 추적합니다. 개발자들이 기능 라이프사이클을 모니터링하고, 숨겨진 기능을 발견하며, 과거 바이너리 대비 스캐닝 도구를 검증하는 데 도움을 줍니다. 릴리스 간 바이너리 콘텐츠 진화에 대한 종단 분석에 활용하세요.

빠른 설치

Claude Code

추천
기본
npx skills add pjt222/agent-almanac -a claude-code
플러그인 명령대체
/plugin add https://github.com/pjt222/agent-almanac
Git 클론대체
git clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/monitor-binary-version-baselines

Claude Code에서 이 명령을 복사하여 붙여넣어 스킬을 설치하세요

문서

Monitor Binary Version Baselines

Build + maintain comparable, version-keyed records of feature-system markers in CLI harness binary → additions/removals/dark-launched detected mechanically across releases.

Use When

  • Track feature lifecycle across releases of closed-source CLI harness
  • Probe dark-launched (shipped but gated off) or quietly-removed
  • Verify scanner still detects known-good markers on old binaries (regression-test scanner)
  • Phase 1 substrate consumed by later phases (flag discovery, dark-launch detection, wire capture)
  • Ad-hoc grep answers "X present today" but you need "how has system X+Y+Z moved across versions"

In

  • Required: ≥1 installed binary versions of same CLI harness (or extracted bundles)
  • Required: Working catalog file for marker defs (created on first run, extended)
  • Optional: Prior baseline file (extended in place, never rewritten)
  • Optional: List of never-published versions (skipped, withdrawn)
  • Optional: List of feature-systems already tracked, to extend not re-discover

Do

Step 1: Markers by Category

Pick stable semantically meaningful identifiers — not minified names bundler will rename next release.

Six categories:

  • API — endpoint paths, method names exposed in network surface
  • Identity — internal product names, codenames, version sentinels
  • Config — recognized keys in user-facing config files
  • Telemetry — event names emitted to analytics pipeline
  • Flag — feature-gate keys consumed by gate predicates
  • Function — well-known string constants in handlers (err msgs, log labels)

Avoid: short minified-looking (_a1, bX, two-letter+digits), inline literals that change w/ text revision, anything matching bundler's internal naming.

→ Each candidate has category tag + short justification ("appears in user-facing docs," "stable across N prior releases"). Typical first pass: 20-50 markers per system.

If err: markers vanish across consecutive minor versions → catalog captured rebuild-volatile not stable. Drop entries, broaden to longer semantically anchored substrings.

Step 2: Group by Feature-System

Bundle markers per system table for independently-evolving capability. "System" = coherent marker set whose presence/absence moves together (shared lifecycle).

Why: per-system scoring prevents cross-contamination. Absence of one system's markers must not suppress detection of another. Aggregate counts across unrelated systems = uninformative.

Working catalog shape (pseudocode):

catalog:
  acme_widget_v3:
    markers:
      - { id: "acme_widget_v3_init",         category: function, weight: 10 }
      - { id: "acme.widget.v3.dialog.open",  category: telemetry, weight: 5 }
      - { id: "ACME_WIDGET_V3_DISABLE",      category: flag,     weight: 10 }
  acme_other_system:
    markers:
      - ...

→ Each system has own marker list; no marker in two systems. New system = new top-level entry, never move retroactively.

If err: hard to assign (overlap, ambiguity) → defs too coarse. Split system or accept "shared substrate", exclude from per-system scoring.

Step 3: Weight by Signal Strength

Per marker:

  • 10 = diagnostic-alone — unique enough alone confirms (long, system-specific, no other code path emits)
  • 3-5 = corroborating only — too generic alone, contributes to aggregate (short telemetry suffix reused)

Convention not specific numbers. Spread "diagnostic" vs. "corroborating" matters more than exact integers. Thresholds Step 5 must distinguish "one strong signal" from "many weak signals."

→ Each marker weighted. Distribution skews toward corroborating (3-5), small number diagnostic-alone (10) per system.

If err: every marker = 10 → scoring loses resolution, partial-presence impossible. Demote markers recurring across systems or in unrelated handlers.

Step 4: Per-Version Baselines

Per scanned version, record both present + absent keyed by version. Both = evidence: absent in N is informative when N+1 reintroduces.

Shape:

baselines:
  "1.4.0":
    acme_widget_v3:
      present: ["acme_widget_v3_init", "ACME_WIDGET_V3_DISABLE"]
      absent:  ["acme.widget.v3.dialog.open"]
      score:   20
  "1.5.0":
    acme_widget_v3:
      present: ["acme_widget_v3_init", "ACME_WIDGET_V3_DISABLE", "acme.widget.v3.dialog.open"]
      absent:  []
      score:   25
  "1.4.1":
    _annotation: "never-published; skipped from upstream release timeline"

Never-published get explicit annotation, not silent omission. Silent skips look like data loss to next reader.

→ Every version has one record per tracked system: present, absent, score populated, or explicit _annotation if never-published.

If err: scan yields zero markers for previously-present system → don't assume removal until you confirm binary path correct, strings cmd produced output, marker IDs match catalog exactly. False zeroes corrupt longitudinal record.

Step 5: Thresholds for Full + Partial Detection

Two gates per system on aggregate score:

  • full — score above which system = present-and-active in this version
  • partial — score above which system = shipped-but-incomplete (some markers, below full)

Below partial = absent (or not-yet-present, depending on direction).

thresholds:
  acme_widget_v3:
    full:    25
    partial: 10

Choose: full = sum of weights healthy install would emit; partial = one diagnostic + corroborating signal. Re-tune w/ several versions evidence.

→ Each scan: per-system finding full | partial | absent. partial warrants investigation — dark-launch + removal candidates.

If err: every system reports partial across every version → thresholds too sensitive (set higher than markers can sum). Recalibrate vs. known-good version where system verifiably live.

Step 6: Scan w/ strings -n 8

strings w/ min length filter as extraction primitive. -n 8 floor filters most noise (short fragments, padding, address-table junk) w/o losing meaningful identifiers (almost always >8 chars).

strings -n 8 path/to/binary > /tmp/binary-strings.txt

Then catalog match vs. /tmp/binary-strings.txt (any line-oriented matcher: grep -F -f markers.txt, ripgrep, small script).

Caveats:

  • Lower (-n 4, -n 6) → flood w/ binary garbage + minified-symbol noise; diagnostic/corroborating collapses
  • Higher (-n 12+) → miss short flag identifiers + config keys
  • Some bundlers compress/encode → near-empty output → may need bundle-extraction first (out of scope)

→ Line-per-string out 1k-100k lines depending on binary size. Manual inspection reveals recognizable identifiers in first 100 lines.

If err: empty/unrecognizable → binary likely packed, encrypted, or bytecode format strings can't read. Stop, resolve at extraction layer. Don't record baseline from unreadable scan.

Step 7: Extend Forward, Don't Rewrite Past Records

New system or marker added to catalog → only forward versions scanned. Past records remain as originally written.

Why: prior baselines = empirical evidence of what was scanned at the time, not current model of what past version contained. Retroactive rewriting conflates "what we know now" w/ "what we observed then." Both useful, only one in baseline file.

Retroactive scan genuinely needed (test if new marker was in N-3) → record as separate addendum:

addenda:
  "1.4.0":
    scan_date: "2026-04-15"
    catalog_revision: "v7"
    findings:
      acme_new_system:
        present: ["..."]

Original baselines["1.4.0"] untouched. Reader sees both original + later retroactive, w/ respective catalog revisions.

→ Baseline grows monotonically forward; past records append-only w/ optional addenda. Catalog revisions versioned so each scan tied back to catalog state used.

If err: urge to edit past version's present list directly → stop. Add addendum. Mutating past records loses ability to detect scanner regressions (later scanner-validation relies on historical record being immutable).

Check

  • Catalog has explicit category tags every marker (API/identity/config/telemetry/flag/function)
  • Every marker assigned to exactly one system; no duplicates
  • Weights span real range (some 10s, some 3-5); not all identical
  • Each scanned version: present + absent + score per tracked system
  • Never-published explicitly annotated, not silent omission
  • Each system: full + partial thresholds; findings labeled
  • strings -n 8 extraction primitive (or documented equivalent for non-text binaries)
  • Past version records unchanged by latest scan; new findings in addenda if retroactive

Traps

  • Recording specific findings as catalog: Catalog should describe marker categories + shapes, not enumerate version-pinned literals. Finding-shaped entries decay fast + highest leak risk if published.
  • Capturing minified identifiers: _p3a, q9X rename every rebuild. Match today = noise tomorrow. Stay w/ semantically meaningful.
  • Conflating telemetry vs. flags: Share naming conventions in many harnesses but different roles. Tag by category (Step 1) so per-category analysis stays clean.
  • Silent skip never-published: Gap w/ no annotation looks like missed scan. Annotate: _annotation: "never-published".
  • Thresholds before baseline data: First scan establishes empirical weight totals; tune vs. that, not in advance.
  • Rewriting prior records when catalog grows: Past records = evidence; addenda = supported pattern for retroactive.
  • Trust empty scan output: Zero found ≠ "absent." Confirm binary readable + catalog IDs match exactly before declaring removal.
  • strings -n 4 more thorough than -n 8: Lower mins add noise faster than signal. Diagnostic markers essentially always 8+ chars.

  • security-audit-codebase — shared discipline; both treat marker presence as finding, different downstream consumers
  • audit-dependency-versions — extends same version-tracking rigor to external dep manifests; this applies to binary artifacts
  • probe-feature-flag-state — Phase 2-3 follow-up; consumes baselines to classify flag rollout state (live/opt-in/dark/removed)
  • conduct-empirical-wire-capture — Phase 4 follow-up; validates inferred behavior vs. actual harness traffic
  • redact-for-public-disclosure — Phase 5 follow-up; governs which findings can leave private workspace

GitHub 저장소

pjt222/agent-almanac
경로: i18n/caveman-ultra/skills/monitor-binary-version-baselines
0
agentsagentskillsai-assisted-developmentclaude-codeskillsteams

연관 스킬

qmd

개발

qmd는 BM25, 벡터 임베딩, 재순위화를 결합한 하이브리드 검색을 통해 로컬 파일을 색인화하고 검색할 수 있는 로컬 검색 및 색인화 CLI 도구입니다. 명령줄 사용과 Claude 통합을 위한 MCP(Model Context Protocol) 모드를 모두 지원합니다. 이 도구는 임베딩에 Ollama를 사용하고 색인을 로컬에 저장하여 터미널에서 직접 문서나 코드베이스를 검색하는 데 이상적입니다.

스킬 보기

subagent-driven-development

개발

이 스킬은 각 독립적인 작업마다 새로운 하위 에이전트를 배치하고 작업 사이에 코드 리뷰를 진행하여 구현 계획을 실행합니다. 이 리뷰 프로세스를 통해 품질 게이트를 유지하면서 빠른 반복 작업을 가능하게 합니다. 동일한 세션 내에서 대부분 독립적인 작업을 진행할 때 내장된 품질 검증과 함께 지속적인 진행을 보장하기 위해 사용하세요.

스킬 보기

mcporter

개발

mcporter 스킬은 개발자가 Claude에서 직접 Model Context Protocol(MCP) 서버를 관리하고 호출할 수 있도록 합니다. 이 스킬은 사용 가능한 서버를 나열하고, 인수를 사용해 해당 서버의 도구를 호출하며, 인증 및 데몬 생명주기를 처리하는 명령어를 제공합니다. 개발 워크플로우에서 MCP 서버 기능을 통합하고 테스트할 때 이 스킬을 사용하세요.

스킬 보기

adk-deployment-specialist

개발

이 스킬은 A2A 프로토콜을 사용하여 Vertex AI ADK 에이전트를 배포하고 오케스트레이션하며, AgentCard 검색, 작업 제출, 코드 실행 샌드박스 및 메모리 뱅크와 같은 지원 도구를 관리합니다. Python, Java 또는 Go 언어로 순차, 병렬 또는 루프 오케스트레이션 패턴을 갖춘 다중 에이전트 시스템 구축을 가능하게 합니다. Google Cloud에서 ADK 에이전트 배포 또는 에이전트 워크플로우 오케스트레이션을 요청받았을 때 사용하세요.

스킬 보기