ml-training-debugger

DNYoussef

Updated Yesterday

3 views

Testingai

About

The ml-training-debugger skill diagnoses and fixes unstable or failing machine learning training runs, handling issues like metric divergence, NaNs, and data problems. It provides validated remediations with traceable evidence, using tools like Bash, Grep, and Task for analysis. Use this specialist skill for triaging training incidents, not for new model development or lightweight prototyping.

Quick Install

Claude Code

Recommended

Plugin CommandRecommended

/plugin add https://github.com/DNYoussef/context-cascade

Git CloneAlternative

git clone https://github.com/DNYoussef/context-cascade.git ~/.claude/skills/ml-training-debugger

Copy and paste this command in Claude Code to install this skill

Documentation

STANDARD OPERATING PROCEDURE

Purpose

Rapidly triage ML training incidents (instability, divergence, degraded metrics) and deliver validated remediations with traceable evidence.

Triggers

Positive: Failing/unstable training runs, unexplained metric drops, NaNs/exploding gradients, data/label issues, reproducibility gaps.
Negative: New model development without incident (route to ml-expert) or lightweight prototyping (route to ml).

Guardrails

Structure-first: ensure SKILL.md, README, examples/, tests/, resources/, and agents/ exist; create missing docs before work.
Constraint extraction: clarify environment (hardware/framework), data provenance, metric targets, and incident timeline.
Validation discipline: reproduce issue, isolate variables (data/model/optim), run minimal change tests; adversarially probe for leakage and nondeterminism.
Confidence ceiling enforced (inference/report 0.70; research 0.85; observation/definition 0.95) with evidence per finding.
Safety: preserve checkpoints/logs; avoid destructive changes; keep rollback path ready.

Execution Phases

Intake & Evidence Gathering
- Collect logs, metrics, configs, seeds, hardware info, and recent code changes.
- Confirm baseline vs expected behavior and incident start time.
Reproduction & Isolation
- Reproduce on smallest dataset slice; fix seeds; disable randomness.
- Binary-search variables: data batches, preprocessing, model changes, optimizer settings.
Hypothesis & Experiment Plan
- Form hypotheses (data corruption, label leakage, optimizer instability, precision issues).
- Plan targeted experiments with success/fail criteria.
Fix & Validation
- Implement minimal fixes; run controlled tests (train/val curves, gradient norms, loss stats).
- Validate against performance/latency targets; ensure no regression on baseline metrics.
Handoff & Prevention
- Document root cause, applied fixes, and remaining risks.
- Add monitors/tests to prevent recurrence; package rollback instructions.

Output Format

Incident summary and constraints.
Reproduction steps, hypotheses, and experiments run.
Fixes applied with before/after metrics.
Prevention plan and owners.
Confidence statement with ceiling.

Validation Checklist

Issue reproduced with fixed seeds and minimal data slice.
Hypotheses tested; experiments documented.
Metrics reported per split; gradients/loss inspected where relevant.
Regression checks executed; rollback path documented.
Confidence ceiling stated.

VCL COMPLIANCE APPENDIX (Internal)

[[HON:teineigo]] [[MOR:root:H-T-A]] [[COM:Hata+Teshis+Analiz]] [[CLS:ge_skill]] [[EVD:-DI<gozlem>]] [[ASP:nesov.]] [[SPC:path:/skills/specialists/ml-training-debugger]]

[[HON:teineigo]] [[MOR:root:E-P-S]] [[COM:Epistemik+Tavan]] [[CLS:ge_rule]] [[EVD:-DI<gozlem>]] [[ASP:nesov.]] [[SPC:coord:EVD-CONF]]

Confidence: 0.74 (ceiling: inference 0.70) - SOP rebuilt with prompt-architect constraint discipline and skill-forge structure/validation rules.

GitHub Repository

DNYoussef/context-cascade

Path: skills/specialists/ml-training-debugger

Related Skills

sglang

evaluating-llms-harness

Testing

This Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.

View skill

llamaguard

Other

LlamaGuard is Meta's 7-8B parameter model for moderating LLM inputs and outputs across six safety categories like violence and hate speech. It offers 94-95% accuracy and can be deployed using vLLM, Hugging Face, or Amazon SageMaker. Use this skill to easily integrate content filtering and safety guardrails into your AI applications.