MCP HubMCP Hub
スキル一覧に戻る

ml-training-debugger

DNYoussef
更新日 Today
53 閲覧
9
2
9
GitHubで表示
テストai

について

ml-training-debuggerスキルは、不安定または失敗した機械学習トレーニング実行を診断・修正し、指標の乖離、NaN、データ問題などの課題に対処します。Bash、Grep、Taskなどのツールを活用した分析により、追跡可能な証拠を伴う検証済みの修正策を提供します。この専門スキルは、トレーニングインシデントのトリアージに使用し、新規モデル開発や軽量プロトタイピングには使用しないでください。

クイックインストール

Claude Code

推奨
プラグインコマンド推奨
/plugin add https://github.com/DNYoussef/context-cascade
Git クローン代替
git clone https://github.com/DNYoussef/context-cascade.git ~/.claude/skills/ml-training-debugger

このコマンドをClaude Codeにコピー&ペーストしてスキルをインストールします

ドキュメント

STANDARD OPERATING PROCEDURE

Purpose

Rapidly triage ML training incidents (instability, divergence, degraded metrics) and deliver validated remediations with traceable evidence.

Triggers

  • Positive: Failing/unstable training runs, unexplained metric drops, NaNs/exploding gradients, data/label issues, reproducibility gaps.
  • Negative: New model development without incident (route to ml-expert) or lightweight prototyping (route to ml).

Guardrails

  • Structure-first: ensure SKILL.md, README, examples/, tests/, resources/, and agents/ exist; create missing docs before work.
  • Constraint extraction: clarify environment (hardware/framework), data provenance, metric targets, and incident timeline.
  • Validation discipline: reproduce issue, isolate variables (data/model/optim), run minimal change tests; adversarially probe for leakage and nondeterminism.
  • Confidence ceiling enforced (inference/report 0.70; research 0.85; observation/definition 0.95) with evidence per finding.
  • Safety: preserve checkpoints/logs; avoid destructive changes; keep rollback path ready.

Execution Phases

  1. Intake & Evidence Gathering
    • Collect logs, metrics, configs, seeds, hardware info, and recent code changes.
    • Confirm baseline vs expected behavior and incident start time.
  2. Reproduction & Isolation
    • Reproduce on smallest dataset slice; fix seeds; disable randomness.
    • Binary-search variables: data batches, preprocessing, model changes, optimizer settings.
  3. Hypothesis & Experiment Plan
    • Form hypotheses (data corruption, label leakage, optimizer instability, precision issues).
    • Plan targeted experiments with success/fail criteria.
  4. Fix & Validation
    • Implement minimal fixes; run controlled tests (train/val curves, gradient norms, loss stats).
    • Validate against performance/latency targets; ensure no regression on baseline metrics.
  5. Handoff & Prevention
    • Document root cause, applied fixes, and remaining risks.
    • Add monitors/tests to prevent recurrence; package rollback instructions.

Output Format

  • Incident summary and constraints.
  • Reproduction steps, hypotheses, and experiments run.
  • Fixes applied with before/after metrics.
  • Prevention plan and owners.
  • Confidence statement with ceiling.

Validation Checklist

  • Issue reproduced with fixed seeds and minimal data slice.
  • Hypotheses tested; experiments documented.
  • Metrics reported per split; gradients/loss inspected where relevant.
  • Regression checks executed; rollback path documented.
  • Confidence ceiling stated.

VCL COMPLIANCE APPENDIX (Internal)

[[HON:teineigo]] [[MOR:root:H-T-A]] [[COM:Hata+Teshis+Analiz]] [[CLS:ge_skill]] [[EVD:-DI<gozlem>]] [[ASP:nesov.]] [[SPC:path:/skills/specialists/ml-training-debugger]]

[[HON:teineigo]] [[MOR:root:E-P-S]] [[COM:Epistemik+Tavan]] [[CLS:ge_rule]] [[EVD:-DI<gozlem>]] [[ASP:nesov.]] [[SPC:coord:EVD-CONF]]

Confidence: 0.74 (ceiling: inference 0.70) - SOP rebuilt with prompt-architect constraint discipline and skill-forge structure/validation rules.

GitHub リポジトリ

DNYoussef/context-cascade
パス: skills/specialists/ml-training-debugger

関連スキル

evaluating-llms-harness

テスト

This Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.

スキルを見る

sglang

メタ

SGLang is a high-performance LLM serving framework that specializes in fast, structured generation for JSON, regex, and agentic workflows using its RadixAttention prefix caching. It delivers significantly faster inference, especially for tasks with repeated prefixes, making it ideal for complex, structured outputs and multi-turn conversations. Choose SGLang over alternatives like vLLM when you need constrained decoding or are building applications with extensive prefix sharing.

スキルを見る

cloudflare-turnstile

メタ

This skill provides comprehensive guidance for implementing Cloudflare Turnstile as a CAPTCHA-alternative bot protection system. It covers integration for forms, login pages, API endpoints, and frameworks like React/Next.js/Hono, while handling invisible challenges that maintain user experience. Use it when migrating from reCAPTCHA, debugging error codes, or implementing token validation and E2E tests.

スキルを見る

langchain

メタ

LangChain is a framework for building LLM applications using agents, chains, and RAG pipelines. It supports multiple LLM providers, offers 500+ integrations, and includes features like tool calling and memory management. Use it for rapid prototyping and deploying production systems like chatbots, autonomous agents, and question-answering services.

スキルを見る