pytorch-lightning
About
PyTorch Lightning is a high-level framework that organizes PyTorch code to minimize boilerplate and provides a powerful Trainer class with automatic distributed training (DDP/FSDP/DeepSpeed). It features a callbacks system and built-in best practices, enabling code to scale from a laptop to a supercomputer without changes. Use it when you want clean, production-ready training loops with minimal setup.
Quick Install
Claude Code
Recommended/plugin add https://github.com/davila7/claude-code-templatesgit clone https://github.com/davila7/claude-code-templates.git ~/.claude/skills/pytorch-lightningCopy and paste this command in Claude Code to install this skill
Documentation
PyTorch Lightning - High-Level Training Framework
Quick start
PyTorch Lightning organizes PyTorch code to eliminate boilerplate while maintaining flexibility.
Installation:
pip install lightning
Convert PyTorch to Lightning (3 steps):
import lightning as L
import torch
from torch import nn
from torch.utils.data import DataLoader, Dataset
# Step 1: Define LightningModule (organize your PyTorch code)
class LitModel(L.LightningModule):
def __init__(self, hidden_size=128):
super().__init__()
self.model = nn.Sequential(
nn.Linear(28 * 28, hidden_size),
nn.ReLU(),
nn.Linear(hidden_size, 10)
)
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self.model(x)
loss = nn.functional.cross_entropy(y_hat, y)
self.log('train_loss', loss) # Auto-logged to TensorBoard
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=1e-3)
# Step 2: Create data
train_loader = DataLoader(train_dataset, batch_size=32)
# Step 3: Train with Trainer (handles everything else!)
trainer = L.Trainer(max_epochs=10, accelerator='gpu', devices=2)
model = LitModel()
trainer.fit(model, train_loader)
That's it! Trainer handles:
- GPU/TPU/CPU switching
- Distributed training (DDP, FSDP, DeepSpeed)
- Mixed precision (FP16, BF16)
- Gradient accumulation
- Checkpointing
- Logging
- Progress bars
Common workflows
Workflow 1: From PyTorch to Lightning
Original PyTorch code:
model = MyModel()
optimizer = torch.optim.Adam(model.parameters())
model.to('cuda')
for epoch in range(max_epochs):
for batch in train_loader:
batch = batch.to('cuda')
optimizer.zero_grad()
loss = model(batch)
loss.backward()
optimizer.step()
Lightning version:
class LitModel(L.LightningModule):
def __init__(self):
super().__init__()
self.model = MyModel()
def training_step(self, batch, batch_idx):
loss = self.model(batch) # No .to('cuda') needed!
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters())
# Train
trainer = L.Trainer(max_epochs=10, accelerator='gpu')
trainer.fit(LitModel(), train_loader)
Benefits: 40+ lines → 15 lines, no device management, automatic distributed
Workflow 2: Validation and testing
class LitModel(L.LightningModule):
def __init__(self):
super().__init__()
self.model = MyModel()
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self.model(x)
loss = nn.functional.cross_entropy(y_hat, y)
self.log('train_loss', loss)
return loss
def validation_step(self, batch, batch_idx):
x, y = batch
y_hat = self.model(x)
val_loss = nn.functional.cross_entropy(y_hat, y)
acc = (y_hat.argmax(dim=1) == y).float().mean()
self.log('val_loss', val_loss)
self.log('val_acc', acc)
def test_step(self, batch, batch_idx):
x, y = batch
y_hat = self.model(x)
test_loss = nn.functional.cross_entropy(y_hat, y)
self.log('test_loss', test_loss)
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=1e-3)
# Train with validation
trainer = L.Trainer(max_epochs=10)
trainer.fit(model, train_loader, val_loader)
# Test
trainer.test(model, test_loader)
Automatic features:
- Validation runs every epoch by default
- Metrics logged to TensorBoard
- Best model checkpointing based on val_loss
Workflow 3: Distributed training (DDP)
# Same code as single GPU!
model = LitModel()
# 8 GPUs with DDP (automatic!)
trainer = L.Trainer(
accelerator='gpu',
devices=8,
strategy='ddp' # Or 'fsdp', 'deepspeed'
)
trainer.fit(model, train_loader)
Launch:
# Single command, Lightning handles the rest
python train.py
No changes needed:
- Automatic data distribution
- Gradient synchronization
- Multi-node support (just set
num_nodes=2)
Workflow 4: Callbacks for monitoring
from lightning.pytorch.callbacks import ModelCheckpoint, EarlyStopping, LearningRateMonitor
# Create callbacks
checkpoint = ModelCheckpoint(
monitor='val_loss',
mode='min',
save_top_k=3,
filename='model-{epoch:02d}-{val_loss:.2f}'
)
early_stop = EarlyStopping(
monitor='val_loss',
patience=5,
mode='min'
)
lr_monitor = LearningRateMonitor(logging_interval='epoch')
# Add to Trainer
trainer = L.Trainer(
max_epochs=100,
callbacks=[checkpoint, early_stop, lr_monitor]
)
trainer.fit(model, train_loader, val_loader)
Result:
- Auto-saves best 3 models
- Stops early if no improvement for 5 epochs
- Logs learning rate to TensorBoard
Workflow 5: Learning rate scheduling
class LitModel(L.LightningModule):
# ... (training_step, etc.)
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
# Cosine annealing
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer,
T_max=100,
eta_min=1e-5
)
return {
'optimizer': optimizer,
'lr_scheduler': {
'scheduler': scheduler,
'interval': 'epoch', # Update per epoch
'frequency': 1
}
}
# Learning rate auto-logged!
trainer = L.Trainer(max_epochs=100)
trainer.fit(model, train_loader)
When to use vs alternatives
Use PyTorch Lightning when:
- Want clean, organized code
- Need production-ready training loops
- Switching between single GPU, multi-GPU, TPU
- Want built-in callbacks and logging
- Team collaboration (standardized structure)
Key advantages:
- Organized: Separates research code from engineering
- Automatic: DDP, FSDP, DeepSpeed with 1 line
- Callbacks: Modular training extensions
- Reproducible: Less boilerplate = fewer bugs
- Tested: 1M+ downloads/month, battle-tested
Use alternatives instead:
- Accelerate: Minimal changes to existing code, more flexibility
- Ray Train: Multi-node orchestration, hyperparameter tuning
- Raw PyTorch: Maximum control, learning purposes
- Keras: TensorFlow ecosystem
Common issues
Issue: Loss not decreasing
Check data and model setup:
# Add to training_step
def training_step(self, batch, batch_idx):
if batch_idx == 0:
print(f"Batch shape: {batch[0].shape}")
print(f"Labels: {batch[1]}")
loss = ...
return loss
Issue: Out of memory
Reduce batch size or use gradient accumulation:
trainer = L.Trainer(
accumulate_grad_batches=4, # Effective batch = batch_size × 4
precision='bf16' # Or 'fp16', reduces memory 50%
)
Issue: Validation not running
Ensure you pass val_loader:
# WRONG
trainer.fit(model, train_loader)
# CORRECT
trainer.fit(model, train_loader, val_loader)
Issue: DDP spawns multiple processes unexpectedly
Lightning auto-detects GPUs. Explicitly set devices:
# Test on CPU first
trainer = L.Trainer(accelerator='cpu', devices=1)
# Then GPU
trainer = L.Trainer(accelerator='gpu', devices=1)
Advanced topics
Callbacks: See references/callbacks.md for EarlyStopping, ModelCheckpoint, custom callbacks, and callback hooks.
Distributed strategies: See references/distributed.md for DDP, FSDP, DeepSpeed ZeRO integration, multi-node setup.
Hyperparameter tuning: See references/hyperparameter-tuning.md for integration with Optuna, Ray Tune, and WandB sweeps.
Hardware requirements
- CPU: Works (good for debugging)
- Single GPU: Works
- Multi-GPU: DDP (default), FSDP, or DeepSpeed
- Multi-node: DDP, FSDP, DeepSpeed
- TPU: Supported (8 cores)
- Apple MPS: Supported
Precision options:
- FP32 (default)
- FP16 (V100, older GPUs)
- BF16 (A100/H100, recommended)
- FP8 (H100)
Resources
- Docs: https://lightning.ai/docs/pytorch/stable/
- GitHub: https://github.com/Lightning-AI/pytorch-lightning ⭐ 29,000+
- Version: 2.5.5+
- Examples: https://github.com/Lightning-AI/pytorch-lightning/tree/master/examples
- Discord: https://discord.gg/lightning-ai
- Used by: Kaggle winners, research labs, production teams
GitHub Repository
Related Skills
axolotl
DesignThis skill provides expert guidance for fine-tuning LLMs using the Axolotl framework, helping developers configure YAML files and implement advanced techniques like LoRA/QLoRA and DPO/KTO. Use it when working with Axolotl features, debugging code, or learning best practices for fine-tuning across 100+ models. It offers comprehensive assistance including multimodal support and performance optimization.
training-llms-megatron
DesignThis skill trains massive language models (2B-462B parameters) using NVIDIA's Megatron-Core framework for maximum GPU efficiency. Use it when training models over 1B parameters, requiring advanced parallelism strategies like tensor or pipeline, or needing production-ready performance. It's a proven framework used for models like Nemotron and LLaMA.
huggingface-accelerate
DevelopmentHuggingFace Accelerate provides a unified API for adding distributed training support to PyTorch scripts with just 4 lines of code. It seamlessly integrates with DeepSpeed, FSDP, Megatron, and DDP while handling automatic device placement and mixed precision. Use this skill when you need to scale PyTorch training across multiple GPUs or nodes with minimal code changes.
pytorch-fsdp
DesignThis skill provides expert guidance for implementing Fully Sharded Data Parallel (FSDP) training in PyTorch. Use it when working with parameter sharding, mixed precision, CPU offloading, or FSDP2 features during distributed training development. It helps with implementation, debugging, and best practices for large-scale model training.
