verify-training-pipeline

kungfuai

Updated Today

23 views

Metaaidesign

About

This skill verifies CVlization training pipelines by checking their structure, build process, training execution, and metric logging. It's designed for validating new implementations or debugging training issues. The skill includes GPU environment awareness for shared systems, advising on resource checks before intensive runs.

Quick Install

Claude Code

Recommended

Plugin CommandRecommended

/plugin add https://github.com/kungfuai/CVlization

Git CloneAlternative

git clone https://github.com/kungfuai/CVlization.git ~/.claude/skills/verify-training-pipeline

Copy and paste this command in Claude Code to install this skill

Documentation

Verify Training Pipeline

Systematically verify that a CVlization training example is complete, properly structured, and functional.

When to Use

Validating a new or modified training example
Debugging training pipeline issues
Ensuring example completeness before commits
Verifying example works after CVlization updates

Important Context

Shared GPU Environment: This machine may be used by multiple users simultaneously. Before running GPU-intensive training:

Check GPU memory availability with nvidia-smi
Wait for sufficient VRAM and low GPU utilization if needed
Consider stopping other processes if you have permission
If CUDA OOM errors occur, wait and retry when GPU is less busy

Verification Checklist

1. Structure Verification

Check that the example directory contains all required files:

# Navigate to example directory
cd examples/<capability>/<task>/<framework>/

# Expected structure:
# .
# ├── example.yaml        # Required: CVL metadata
# ├── Dockerfile          # Required: Container definition
# ├── build.sh            # Required: Build script
# ├── train.sh            # Required: Training script
# ├── train.py            # Required: Training code
# ├── README.md           # Recommended: Documentation
# ├── requirements.txt    # Optional: Python dependencies
# ├── data/               # Optional: Data directory
# └── outputs/            # Created at runtime

Key files to check:

example.yaml - Must have: name, capability, stability, presets (build, train)
Dockerfile - Should copy necessary files and install dependencies
build.sh - Must set SCRIPT_DIR and call docker build
train.sh - Must mount volumes correctly and pass environment variables

2. Build Verification

# Option 1: Build using script directly
./build.sh

# Option 2: Build using CVL CLI (recommended)
cvl run <example-name> build

# Verify image was created
docker images | grep <example-name>

# Expected: Image appears with recent timestamp

What to check:

Build completes without errors (both methods)
All dependencies install successfully
Image size is reasonable (check for unnecessary files)
cvl info <example-name> shows correct metadata

3. Training Verification

Start training and monitor for proper initialization:

# Option 1: Run training using script directly
./train.sh

# Option 2: Run training using CVL CLI (recommended)
cvl run <example-name> train

# With custom parameters (if supported)
BATCH_SIZE=2 NUM_EPOCHS=1 ./train.sh

Immediate checks (first 30-60 seconds):

Container starts without errors
Dataset loads successfully
Model initializes (check GPU memory with nvidia-smi)
Training loop begins (first batch processes)
Logs are being written

4. Metrics Verification

Monitor metrics appropriate to the task type:

Generative Tasks (LLM, Text Generation, Image Generation)

Primary metric: train/loss (should decrease over time)
Target: Loss consistently decreasing, not NaN/Inf
Typical range: Depends on task (LLM: 2-5 initial, <1 after convergence)
Check for: Gradient explosions, NaN losses

# For LLM/generative models
tail -f logs/train.log | grep -i "loss\|iter\|step"

Classification Tasks (Image, Text, Document)

Primary metrics: train/loss, train/accuracy, val/accuracy
Target: Accuracy increasing, loss decreasing
Typical range: Accuracy 0-100%, converges based on task difficulty
Check for: Overfitting (train acc >> val acc)

# Watch accuracy metrics
tail -f lightning_logs/version_0/metrics.csv
# or for WandB
tail -f logs/train.log | grep -i "accuracy\|acc"

Object Detection Tasks

Primary metrics: train/loss, val/map (mean Average Precision), val/map_50
Target: mAP increasing, loss decreasing
Typical range: mAP 0-100, good models achieve 30-90% depending on dataset
Components: loss_classifier, loss_box_reg, loss_objectness, loss_rpn_box_reg

# Monitor detection metrics
tail -f logs/train.log | grep -i "map\|loss_classifier\|loss_box"

Segmentation Tasks (Semantic, Instance, Panoptic)

Primary metrics: train/loss, val/iou (Intersection over Union), val/pixel_accuracy
Target: IoU increasing (>0.5 is decent, >0.7 is good), loss decreasing
Typical range: IoU 0-1, pixel accuracy 0-100%
Variants: mIoU (mean IoU across classes)

# Monitor segmentation metrics
tail -f lightning_logs/version_0/metrics.csv | grep -i "iou\|pixel"

Fine-tuning / Transfer Learning

Primary metrics: train/loss, eval/loss, task-specific metrics
Target: Both losses decreasing, eval loss not diverging from train loss
Check for: Catastrophic forgetting, adapter convergence
Special: For LoRA/DoRA, verify adapters are saved

# Check if adapters are being saved
ls -la outputs/*/lora_adapters/
# Should contain: adapter_config.json, adapter_model.safetensors

5. Runtime Checks

GPU VRAM Usage Monitoring (REQUIRED):

Before, during, and after training, actively monitor GPU VRAM usage:

# In another terminal, watch GPU memory in real-time
watch -n 1 nvidia-smi

# Or get detailed memory breakdown
nvidia-smi --query-gpu=index,name,memory.used,memory.total,memory.free,utilization.gpu --format=csv,noheader,nounits

# Record peak VRAM usage during training
nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | awk '{print $1 " MB"}'

Expected metrics:

GPU memory usage: 60-95% of available VRAM (adjust batch size if 100% or <30%)
GPU utilization: 70-100% during training steps
Temperature: Stable (<85°C)
Memory behavior: Should stabilize after model loading, spike during forward/backward passes

What to record for verification metadata:

Peak VRAM usage in GB (e.g., "7.4GB VRAM" or "3.2GB VRAM")
Percentage of total VRAM (e.g., "32%" for 7.4GB on 24GB GPU)
GPU utilization percentage (e.g., "100% GPU utilization")

Troubleshooting:

CUDA OOM: Reduce BATCH_SIZE, MAX_SEQ_LEN, or model size
Low GPU utilization (<50%): Check data loading bottlenecks, increase batch size
Memory keeps growing: Possible memory leak, check gradient accumulation

Docker Container Health:

# List running containers
docker ps

# Check logs for errors
docker logs <container-name-or-id>

# Verify mounts
docker inspect <container-id> | grep -A 10 Mounts
# Should see: workspace, cvlization_repo, huggingface cache

Output Directory:

# Check outputs are being written
ls -la outputs/ logs/ lightning_logs/
# Expected: Checkpoints, logs, or saved models appearing

# For WandB integration
ls -la wandb/
# Expected: run-<timestamp>-<id> directories

6. Lazy Downloading & Caching Verification

Verify that datasets and pretrained weights are cached properly:

# Check CVlization dataset cache
ls -la ~/.cache/cvlization/data/
# Expected: Dataset archives and extracted folders
# Examples: coco_panoptic_tiny/, stanford_background/, etc.

# Check framework-specific caches
ls -la ~/.cache/torch/hub/checkpoints/       # PyTorch pretrained weights
ls -la ~/.cache/huggingface/                 # HuggingFace models

# Verify no repeated downloads on second run
# First run: Should see "Downloading..." messages
./train.sh 2>&1 | tee first_run.log

# Clean workspace data (but keep cache)
rm -rf ./data/

# Second run: Should NOT download again, uses cache
./train.sh 2>&1 | tee second_run.log

# Verify no download messages in second run
grep -i "download" second_run.log
# Expected: Minimal or no download activity (weights already cached)

What to verify:

Training data downloads to ~/.cache/cvlization/data/ (not ./data/)
Pretrained weights cached by framework (PyTorch: ~/.cache/torch/, HuggingFace: ~/.cache/huggingface/)
Second run reuses cached files without re-downloading
Check train.py for data_dir parameter passed to dataset builders

7. Quick Validation Test

For fast verification (useful during development):

# Run 1 epoch with limited data
MAX_TRAIN_SAMPLES=10 NUM_EPOCHS=1 ./train.sh

# Expected runtime: 1-5 minutes
# Verify: Completes without errors, metrics logged

8. Update Verification Metadata

After successful verification, update the example.yaml with verification metadata:

First, check GPU info:

# Get GPU model and VRAM
nvidia-smi --query-gpu=name,memory.total --format=csv,noheader

Format:

verification:
  last_verified: 2025-10-25
  last_verification_note: "Verified build, training initialization, lazy downloading, and metrics logging on [GPU_MODEL] ([VRAM]GB VRAM)"

What to include in the note:

What was verified: build, training, metrics
Key aspects: lazy downloading, caching, GPU utilization
GPU info: Dynamically determine GPU model and VRAM using nvidia-smi (e.g., "A10 GPU (24GB VRAM)", "RTX 4090 (24GB)")
- If no GPU: Use "CPU-only"
VRAM usage: Peak VRAM used during training (e.g., "GPU usage: 7.4GB VRAM (32%), 100% GPU utilization")
- Get with: nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits
- Convert to GB and calculate percentage of total VRAM
Training extent: e.g., "1 epoch quick test" or "Full 10 epoch training"
Any limitations: e.g., "CUDA OOM on full batch size"

Example complete entry:

name: pose-estimation-mmpose
docker: mmpose
capability: perception/pose_estimation
# ... other fields ...

verification:
  last_verified: 2025-10-25
  last_verification_note: "Verified build, CVL CLI integration, and lazy downloading to ~/.cache/cvlization/data/. Training not fully verified due to GPU memory constraints (CUDA OOM on shared GPU)."

When to update:

After completing full verification checklist (steps 1-7)
Partial verification is acceptable - note what was verified
When re-verifying after CVlization updates or fixes

Common Issues and Fixes

Build Failures

# Issue: Dockerfile can't find files
# Fix: Check COPY paths are relative to Dockerfile location

# Issue: Dependency conflicts
# Fix: Check requirements.txt versions, update base image

# Issue: Large build context
# Fix: Add .dockerignore file

Training Failures

# Issue: CUDA out of memory
# Fix: Reduce BATCH_SIZE, MAX_SEQ_LEN, or image size

# Issue: Dataset not found
# Fix: Check data/ directory exists, run data preparation script

# Issue: Permission denied on outputs
# Fix: Ensure output directories are created before docker run

Metric Issues

# Issue: Loss is NaN
# Fix: Reduce learning rate, check data normalization, verify labels

# Issue: No metrics logged
# Fix: Check training script has logging configured (wandb/tensorboard)

# Issue: Loss not decreasing
# Fix: Verify learning rate, check data quality, increase epochs

Example Commands

Perception - Object Detection

cd examples/perception/object_detection/torchvision
./build.sh
./train.sh
# Monitor: train/loss, val/map, val/map_50
# Success: mAP > 0.3 after a few epochs

Perception - Semantic Segmentation

cd examples/perception/segmentation/semantic_torchvision
./build.sh
./train.sh
# Monitor: train/loss, val/iou, val/pixel_accuracy
# Success: IoU > 0.5, pixel_accuracy > 80%

Generative - LLM Training

cd examples/generative/llm/nanogpt
./build.sh
./train.sh
# Monitor: train/loss, val/loss, iter time
# Success: Loss decreasing from ~4.0 to <2.0

Document AI - Fine-tuning

cd examples/perception/doc_ai/granite_docling_finetune
./build.sh
MAX_TRAIN_SAMPLES=20 NUM_EPOCHS=1 ./train.sh
# Monitor: train/loss, eval/loss
# Success: Both losses decrease, adapters saved to outputs/

CVL Integration

These examples integrate with the CVL command system:

# List all available examples
cvl list

# Get example info
cvl info granite_docling_finetune

# Run example directly (uses example.yaml presets)
cvl run granite_docling_finetune build
cvl run granite_docling_finetune train

Success Criteria

A training pipeline passes verification when:

✅ Structure: All required files present, example.yaml valid
✅ Build: Docker image builds without errors (both ./build.sh and cvl run <name> build)
✅ Start: Training starts, dataset loads, model initializes (both ./train.sh and cvl run <name> train)
✅ Metrics Improve: Training loss decreases OR model accuracy/mAP/IoU improves over epochs
✅ Central Caching: Training data cached to ~/.cache/cvlization/data/ (NOT to local ./data/), pretrained weights cached to framework-specific locations (~/.cache/torch/, ~/.cache/huggingface/)
✅ Lazy Downloading: Datasets and pretrained weights download only when needed, avoiding repeated downloads on subsequent runs
✅ Outputs: Checkpoints/adapters/logs saved to outputs/
✅ CVL CLI: cvl info <name> shows correct metadata, build and train presets work
✅ Documentation: README explains how to use the example
✅ Verification Metadata: example.yaml updated with verification field containing last_verified date and last_verification_note

Related Files

Check these files for debugging:

train.py - Core training logic
Dockerfile - Environment setup
requirements.txt - Python dependencies
example.yaml - CVL metadata and presets
README.md - Usage instructions

Tips

Use MAX_TRAIN_SAMPLES=<small_number> for fast validation
Monitor GPU memory with nvidia-smi in separate terminal
Check docker logs <container> if training hangs
For WandB integration, set WANDB_API_KEY environment variable
Most examples support environment variable overrides (check train.sh)

GitHub Repository

kungfuai/CVlization

Path: .claude/skills/verify-training-pipeline

aidecentralizeddockerfinetunegenerative-aiinference

Related Skills

sglang

evaluating-llms-harness

Testing

This Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.

View skill

llamaguard

Other

LlamaGuard is Meta's 7-8B parameter model for moderating LLM inputs and outputs across six safety categories like violence and hate speech. It offers 94-95% accuracy and can be deployed using vLLM, Hugging Face, or Amazon SageMaker. Use this skill to easily integrate content filtering and safety guardrails into your AI applications.