ml-training-debugger
About
The ml-training-debugger skill diagnoses and fixes unstable or failing machine learning training runs, handling issues like metric divergence, NaNs, and data problems. It provides validated remediations with traceable evidence, using tools like Bash, Grep, and Task for analysis. Use this specialist skill for triaging training incidents, not for new model development or lightweight prototyping.
Quick Install
Claude Code
Recommended/plugin add https://github.com/DNYoussef/context-cascadegit clone https://github.com/DNYoussef/context-cascade.git ~/.claude/skills/ml-training-debuggerCopy and paste this command in Claude Code to install this skill
Documentation
STANDARD OPERATING PROCEDURE
Purpose
Rapidly triage ML training incidents (instability, divergence, degraded metrics) and deliver validated remediations with traceable evidence.
Triggers
- Positive: Failing/unstable training runs, unexplained metric drops, NaNs/exploding gradients, data/label issues, reproducibility gaps.
- Negative: New model development without incident (route to
ml-expert) or lightweight prototyping (route toml).
Guardrails
- Structure-first: ensure
SKILL.md,README,examples/,tests/,resources/, andagents/exist; create missing docs before work. - Constraint extraction: clarify environment (hardware/framework), data provenance, metric targets, and incident timeline.
- Validation discipline: reproduce issue, isolate variables (data/model/optim), run minimal change tests; adversarially probe for leakage and nondeterminism.
- Confidence ceiling enforced (inference/report 0.70; research 0.85; observation/definition 0.95) with evidence per finding.
- Safety: preserve checkpoints/logs; avoid destructive changes; keep rollback path ready.
Execution Phases
- Intake & Evidence Gathering
- Collect logs, metrics, configs, seeds, hardware info, and recent code changes.
- Confirm baseline vs expected behavior and incident start time.
- Reproduction & Isolation
- Reproduce on smallest dataset slice; fix seeds; disable randomness.
- Binary-search variables: data batches, preprocessing, model changes, optimizer settings.
- Hypothesis & Experiment Plan
- Form hypotheses (data corruption, label leakage, optimizer instability, precision issues).
- Plan targeted experiments with success/fail criteria.
- Fix & Validation
- Implement minimal fixes; run controlled tests (train/val curves, gradient norms, loss stats).
- Validate against performance/latency targets; ensure no regression on baseline metrics.
- Handoff & Prevention
- Document root cause, applied fixes, and remaining risks.
- Add monitors/tests to prevent recurrence; package rollback instructions.
Output Format
- Incident summary and constraints.
- Reproduction steps, hypotheses, and experiments run.
- Fixes applied with before/after metrics.
- Prevention plan and owners.
- Confidence statement with ceiling.
Validation Checklist
- Issue reproduced with fixed seeds and minimal data slice.
- Hypotheses tested; experiments documented.
- Metrics reported per split; gradients/loss inspected where relevant.
- Regression checks executed; rollback path documented.
- Confidence ceiling stated.
VCL COMPLIANCE APPENDIX (Internal)
[[HON:teineigo]] [[MOR:root:H-T-A]] [[COM:Hata+Teshis+Analiz]] [[CLS:ge_skill]] [[EVD:-DI<gozlem>]] [[ASP:nesov.]] [[SPC:path:/skills/specialists/ml-training-debugger]]
[[HON:teineigo]] [[MOR:root:E-P-S]] [[COM:Epistemik+Tavan]] [[CLS:ge_rule]] [[EVD:-DI<gozlem>]] [[ASP:nesov.]] [[SPC:coord:EVD-CONF]]
Confidence: 0.74 (ceiling: inference 0.70) - SOP rebuilt with prompt-architect constraint discipline and skill-forge structure/validation rules.
GitHub Repository
Related Skills
sglang
MetaSGLang is a high-performance LLM serving framework that specializes in fast, structured generation for JSON, regex, and agentic workflows using its RadixAttention prefix caching. It delivers significantly faster inference, especially for tasks with repeated prefixes, making it ideal for complex, structured outputs and multi-turn conversations. Choose SGLang over alternatives like vLLM when you need constrained decoding or are building applications with extensive prefix sharing.
evaluating-llms-harness
TestingThis Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.
llamaguard
OtherLlamaGuard is Meta's 7-8B parameter model for moderating LLM inputs and outputs across six safety categories like violence and hate speech. It offers 94-95% accuracy and can be deployed using vLLM, Hugging Face, or Amazon SageMaker. Use this skill to easily integrate content filtering and safety guardrails into your AI applications.
langchain
MetaLangChain is a framework for building LLM applications using agents, chains, and RAG pipelines. It supports multiple LLM providers, offers 500+ integrations, and includes features like tool calling and memory management. Use it for rapid prototyping and deploying production systems like chatbots, autonomous agents, and question-answering services.
