pytorch-lightning
정보
이 Claude Skill은 PyTorch Lightning을 통해 확장 가능한 딥러닝을 구현하며, 코드를 LightningModule로 구성하고 GPU/TPU 분산 학습을 위한 Trainer를 설정합니다. 데이터 파이프라인, 콜백, 로깅 통합(W&B, TensorBoard, MLflow)을 처리하면서 DDP, FSDP, DeepSpeed 같은 고급 기법을 지원합니다. 최소한의 보일러플레이트 코드로 신경망 학습 워크플로우를 간소화하고 확장하는 데 사용하세요.
빠른 설치
Claude Code
추천npx skills add K-Dense-AI/claude-scientific-skills -a claude-code/plugin add https://github.com/K-Dense-AI/claude-scientific-skillsgit clone https://github.com/K-Dense-AI/claude-scientific-skills.git ~/.claude/skills/pytorch-lightningClaude Code에서 이 명령을 복사하여 붙여넣어 스킬을 설치하세요
문서
PyTorch Lightning
Overview
PyTorch Lightning is a deep learning framework that organizes PyTorch code to eliminate boilerplate while maintaining full flexibility. Automate training workflows, multi-device orchestration, and implement best practices for neural network training and scaling across multiple GPUs/TPUs.
Current upstream: lightning 2.6.4 (PyPI, May 2026). Docs: lightning.ai/docs/pytorch/stable. Use import lightning as L (the pytorch-lightning package name still installs the same library).
Installation
uv pip install lightning
Optional extras:
uv pip install lightning[extra] # loggers, strategies, etc.
uv pip install wandb mlflow # specific loggers as needed
When to Use This Skill
This skill should be used when:
- Building, training, or deploying neural networks using PyTorch Lightning
- Organizing PyTorch code into LightningModules
- Configuring Trainers for multi-GPU/TPU training
- Implementing data pipelines with LightningDataModules
- Working with callbacks, logging, and distributed training strategies (DDP, FSDP, DeepSpeed)
- Structuring deep learning projects professionally
Core Capabilities
1. LightningModule - Model Definition
Organize PyTorch models into six logical sections:
- Initialization -
__init__()andsetup() - Training Loop -
training_step(batch, batch_idx) - Validation Loop -
validation_step(batch, batch_idx) - Test Loop -
test_step(batch, batch_idx) - Prediction -
predict_step(batch, batch_idx) - Optimizer Configuration -
configure_optimizers()
Quick template reference: See scripts/template_lightning_module.py for a complete boilerplate.
Detailed documentation: Read references/lightning_module.md for comprehensive method documentation, hooks, properties, and best practices.
2. Trainer - Training Automation
The Trainer automates the training loop, device management, gradient operations, and callbacks. Key features:
- Multi-GPU/TPU support with strategy selection (DDP, FSDP, DeepSpeed)
- Automatic mixed precision training
- Gradient accumulation and clipping
- Checkpointing and early stopping
- Progress bars and logging
Quick setup reference: See scripts/quick_trainer_setup.py for common Trainer configurations.
Detailed documentation: Read references/trainer.md for all parameters, methods, and configuration options.
3. LightningDataModule - Data Pipeline Organization
Encapsulate all data processing steps in a reusable class:
prepare_data()- Download and process data (single-process)setup()- Create datasets and apply transforms (per-GPU)train_dataloader()- Return training DataLoaderval_dataloader()- Return validation DataLoadertest_dataloader()- Return test DataLoader
Quick template reference: See scripts/template_datamodule.py for a complete boilerplate.
Detailed documentation: Read references/data_module.md for method details and usage patterns.
4. Callbacks - Extensible Training Logic
Add custom functionality at specific training hooks without modifying your LightningModule. Built-in callbacks include:
- ModelCheckpoint - Save best/latest models
- EarlyStopping - Stop when metrics plateau
- LearningRateMonitor - Track LR scheduler changes
- BatchSizeFinder - Auto-determine optimal batch size
Detailed documentation: Read references/callbacks.md for built-in callbacks and custom callback creation.
5. Logging - Experiment Tracking
Integrate with multiple logging platforms:
- TensorBoard (default)
- Weights & Biases (WandbLogger)
- MLflow (MLFlowLogger)
- Comet (CometLogger)
- CSV (CSVLogger)
Note: NeptuneLogger was removed in lightning 2.6.4. Use W&B, MLflow, or TensorBoard instead.
Log metrics using self.log("metric_name", value) in any LightningModule method.
Detailed documentation: Read references/logging.md for logger setup and configuration.
6. Distributed Training - Scale to Multiple Devices
Choose the right strategy based on model size:
- DDP - For models <500M parameters (ResNet, smaller transformers)
- FSDP - For models 500M+ parameters (large transformers, recommended for Lightning users)
- DeepSpeed - For cutting-edge features and fine-grained control
Configure with: Trainer(strategy="ddp", accelerator="gpu", devices=4)
Detailed documentation: Read references/distributed_training.md for strategy comparison and configuration.
7. Best Practices
- Device agnostic code - Use
self.deviceinstead of.cuda() - Hyperparameter saving - Use
self.save_hyperparameters()in__init__() - Metric logging - Use
self.log()for automatic aggregation across devices - Reproducibility - Use
seed_everything()andTrainer(deterministic=True) - Debugging - Use
Trainer(fast_dev_run=True)to test with 1 batch
Detailed documentation: Read references/best_practices.md for common patterns and pitfalls.
Quick Workflow
-
Define model:
class MyModel(L.LightningModule): def __init__(self): super().__init__() self.save_hyperparameters() self.model = YourNetwork() def training_step(self, batch, batch_idx): x, y = batch loss = F.cross_entropy(self.model(x), y) self.log("train_loss", loss) return loss def configure_optimizers(self): return torch.optim.Adam(self.parameters()) -
Prepare data:
# Option 1: Direct DataLoaders train_loader = DataLoader(train_dataset, batch_size=32) # Option 2: LightningDataModule (recommended for reusability) dm = MyDataModule(batch_size=32) -
Train:
trainer = L.Trainer(max_epochs=10, accelerator="gpu", devices=2) trainer.fit(model, train_loader) # or trainer.fit(model, datamodule=dm)
Resources
scripts/
Executable Python templates for common PyTorch Lightning patterns:
template_lightning_module.py- Complete LightningModule boilerplatetemplate_datamodule.py- Complete LightningDataModule boilerplatequick_trainer_setup.py- Common Trainer configuration examples
references/
Detailed documentation for each PyTorch Lightning component:
lightning_module.md- Comprehensive LightningModule guide (methods, hooks, properties)trainer.md- Trainer configuration and parametersdata_module.md- LightningDataModule patterns and methodscallbacks.md- Built-in and custom callbackslogging.md- Logger integrations and usagedistributed_training.md- DDP, FSDP, DeepSpeed comparison and setupbest_practices.md- Common patterns, tips, and pitfalls
GitHub 저장소
연관 스킬
qmd
개발qmd는 BM25, 벡터 임베딩, 재순위화를 결합한 하이브리드 검색을 통해 로컬 파일을 색인화하고 검색할 수 있는 로컬 검색 및 색인화 CLI 도구입니다. 명령줄 사용과 Claude 통합을 위한 MCP(Model Context Protocol) 모드를 모두 지원합니다. 이 도구는 임베딩에 Ollama를 사용하고 색인을 로컬에 저장하여 터미널에서 직접 문서나 코드베이스를 검색하는 데 이상적입니다.
subagent-driven-development
개발이 스킬은 각 독립적인 작업마다 새로운 하위 에이전트를 배치하고 작업 사이에 코드 리뷰를 진행하여 구현 계획을 실행합니다. 이 리뷰 프로세스를 통해 품질 게이트를 유지하면서 빠른 반복 작업을 가능하게 합니다. 동일한 세션 내에서 대부분 독립적인 작업을 진행할 때 내장된 품질 검증과 함께 지속적인 진행을 보장하기 위해 사용하세요.
mcporter
개발mcporter 스킬은 개발자가 Claude에서 직접 Model Context Protocol(MCP) 서버를 관리하고 호출할 수 있도록 합니다. 이 스킬은 사용 가능한 서버를 나열하고, 인수를 사용해 해당 서버의 도구를 호출하며, 인증 및 데몬 생명주기를 처리하는 명령어를 제공합니다. 개발 워크플로우에서 MCP 서버 기능을 통합하고 테스트할 때 이 스킬을 사용하세요.
adk-deployment-specialist
개발이 스킬은 A2A 프로토콜을 사용하여 Vertex AI ADK 에이전트를 배포하고 오케스트레이션하며, AgentCard 검색, 작업 제출, 코드 실행 샌드박스 및 메모리 뱅크와 같은 지원 도구를 관리합니다. Python, Java 또는 Go 언어로 순차, 병렬 또는 루프 오케스트레이션 패턴을 갖춘 다중 에이전트 시스템 구축을 가능하게 합니다. Google Cloud에서 ADK 에이전트 배포 또는 에이전트 워크플로우 오케스트레이션을 요청받았을 때 사용하세요.
