ml-training-debugger

DNYoussef

更新日 Today

53 閲覧

テストai

について

ml-training-debuggerスキルは、不安定または失敗した機械学習トレーニング実行を診断・修正し、指標の乖離、NaN、データ問題などの課題に対処します。Bash、Grep、Taskなどのツールを活用した分析により、追跡可能な証拠を伴う検証済みの修正策を提供します。この専門スキルは、トレーニングインシデントのトリアージに使用し、新規モデル開発や軽量プロトタイピングには使用しないでください。

クイックインストール

Claude Code

推奨

プラグインコマンド推奨

/plugin add https://github.com/DNYoussef/context-cascade

Git クローン代替

git clone https://github.com/DNYoussef/context-cascade.git ~/.claude/skills/ml-training-debugger

このコマンドをClaude Codeにコピー＆ペーストしてスキルをインストールします

ドキュメント

STANDARD OPERATING PROCEDURE

Purpose

Rapidly triage ML training incidents (instability, divergence, degraded metrics) and deliver validated remediations with traceable evidence.

Triggers

Positive: Failing/unstable training runs, unexplained metric drops, NaNs/exploding gradients, data/label issues, reproducibility gaps.
Negative: New model development without incident (route to ml-expert) or lightweight prototyping (route to ml).

Guardrails

Structure-first: ensure SKILL.md, README, examples/, tests/, resources/, and agents/ exist; create missing docs before work.
Constraint extraction: clarify environment (hardware/framework), data provenance, metric targets, and incident timeline.
Validation discipline: reproduce issue, isolate variables (data/model/optim), run minimal change tests; adversarially probe for leakage and nondeterminism.
Confidence ceiling enforced (inference/report 0.70; research 0.85; observation/definition 0.95) with evidence per finding.
Safety: preserve checkpoints/logs; avoid destructive changes; keep rollback path ready.

Execution Phases

Intake & Evidence Gathering
- Collect logs, metrics, configs, seeds, hardware info, and recent code changes.
- Confirm baseline vs expected behavior and incident start time.
Reproduction & Isolation
- Reproduce on smallest dataset slice; fix seeds; disable randomness.
- Binary-search variables: data batches, preprocessing, model changes, optimizer settings.
Hypothesis & Experiment Plan
- Form hypotheses (data corruption, label leakage, optimizer instability, precision issues).
- Plan targeted experiments with success/fail criteria.
Fix & Validation
- Implement minimal fixes; run controlled tests (train/val curves, gradient norms, loss stats).
- Validate against performance/latency targets; ensure no regression on baseline metrics.
Handoff & Prevention
- Document root cause, applied fixes, and remaining risks.
- Add monitors/tests to prevent recurrence; package rollback instructions.

Output Format

Incident summary and constraints.
Reproduction steps, hypotheses, and experiments run.
Fixes applied with before/after metrics.
Prevention plan and owners.
Confidence statement with ceiling.

Validation Checklist

Issue reproduced with fixed seeds and minimal data slice.
Hypotheses tested; experiments documented.
Metrics reported per split; gradients/loss inspected where relevant.
Regression checks executed; rollback path documented.
Confidence ceiling stated.

VCL COMPLIANCE APPENDIX (Internal)

[[HON:teineigo]] [[MOR:root:H-T-A]] [[COM:Hata+Teshis+Analiz]] [[CLS:ge_skill]] [[EVD:-DI<gozlem>]] [[ASP:nesov.]] [[SPC:path:/skills/specialists/ml-training-debugger]]

[[HON:teineigo]] [[MOR:root:E-P-S]] [[COM:Epistemik+Tavan]] [[CLS:ge_rule]] [[EVD:-DI<gozlem>]] [[ASP:nesov.]] [[SPC:coord:EVD-CONF]]

Confidence: 0.74 (ceiling: inference 0.70) - SOP rebuilt with prompt-architect constraint discipline and skill-forge structure/validation rules.

GitHub リポジトリ

DNYoussef/context-cascade

パス: skills/specialists/ml-training-debugger