implementing-llms-litgpt

davila7

Updated Today

8 views

15,516

1,344

15,516

View on GitHub

OtherModel ArchitectureLitGPTLightning AILLM ImplementationLoRAQLoRAFine-TuningLlamaGemmaPhiMistralEducational

About

This Claude Skill implements and trains LLMs using Lightning AI's LitGPT framework, providing 20+ clean, single-file model implementations like Llama and Mistral. It's ideal for developers needing educational architecture understanding or production fine-tuning with LoRA/QLoRA. The skill offers minimal abstraction layers for straightforward model loading, training, and text generation workflows.

Quick Install

Claude Code

Recommended

Plugin CommandRecommended

/plugin add https://github.com/davila7/claude-code-templates

Git CloneAlternative

git clone https://github.com/davila7/claude-code-templates.git ~/.claude/skills/implementing-llms-litgpt

Copy and paste this command in Claude Code to install this skill

Documentation

LitGPT - Clean LLM Implementations

Quick start

LitGPT provides 20+ pretrained LLM implementations with clean, readable code and production-ready training workflows.

Installation:

pip install 'litgpt[extra]'

Load and use any model:

from litgpt import LLM

# Load pretrained model
llm = LLM.load("microsoft/phi-2")

# Generate text
result = llm.generate(
    "What is the capital of France?",
    max_new_tokens=50,
    temperature=0.7
)
print(result)

List available models:

litgpt download list

Common workflows

Workflow 1: Fine-tune on custom dataset

Copy this checklist:

Fine-Tuning Setup:
- [ ] Step 1: Download pretrained model
- [ ] Step 2: Prepare dataset
- [ ] Step 3: Configure training
- [ ] Step 4: Run fine-tuning

Step 1: Download pretrained model

# Download Llama 3 8B
litgpt download meta-llama/Meta-Llama-3-8B

# Download Phi-2 (smaller, faster)
litgpt download microsoft/phi-2

# Download Gemma 2B
litgpt download google/gemma-2b

Models are saved to checkpoints/ directory.

Step 2: Prepare dataset

LitGPT supports multiple formats:

Alpaca format (instruction-response):

[
  {
    "instruction": "What is the capital of France?",
    "input": "",
    "output": "The capital of France is Paris."
  },
  {
    "instruction": "Translate to Spanish: Hello, how are you?",
    "input": "",
    "output": "Hola, ¿cómo estás?"
  }
]

Save as data/my_dataset.json.

Step 3: Configure training

# Full fine-tuning (requires 40GB+ GPU for 7B models)
litgpt finetune \
  meta-llama/Meta-Llama-3-8B \
  --data JSON \
  --data.json_path data/my_dataset.json \
  --train.max_steps 1000 \
  --train.learning_rate 2e-5 \
  --train.micro_batch_size 1 \
  --train.global_batch_size 16

# LoRA fine-tuning (efficient, 16GB GPU)
litgpt finetune_lora \
  microsoft/phi-2 \
  --data JSON \
  --data.json_path data/my_dataset.json \
  --lora_r 16 \
  --lora_alpha 32 \
  --lora_dropout 0.05 \
  --train.max_steps 1000 \
  --train.learning_rate 1e-4

Step 4: Run fine-tuning

Training saves checkpoints to out/finetune/ automatically.

Monitor training:

# View logs
tail -f out/finetune/logs.txt

# TensorBoard (if using --train.logger_name tensorboard)
tensorboard --logdir out/finetune/lightning_logs

Workflow 2: LoRA fine-tuning on single GPU

Most memory-efficient option.

LoRA Training:
- [ ] Step 1: Choose base model
- [ ] Step 2: Configure LoRA parameters
- [ ] Step 3: Train with LoRA
- [ ] Step 4: Merge LoRA weights (optional)

Step 1: Choose base model

For limited GPU memory (12-16GB):

Phi-2 (2.7B) - Best quality/size tradeoff
Llama 3 1B - Smallest, fastest
Gemma 2B - Good reasoning

Step 2: Configure LoRA parameters

litgpt finetune_lora \
  microsoft/phi-2 \
  --data JSON \
  --data.json_path data/my_dataset.json \
  --lora_r 16 \          # LoRA rank (8-64, higher=more capacity)
  --lora_alpha 32 \      # LoRA scaling (typically 2×r)
  --lora_dropout 0.05 \  # Prevent overfitting
  --lora_query true \    # Apply LoRA to query projection
  --lora_key false \     # Usually not needed
  --lora_value true \    # Apply LoRA to value projection
  --lora_projection true \  # Apply LoRA to output projection
  --lora_mlp false \     # Usually not needed
  --lora_head false      # Usually not needed

LoRA rank guide:

r=8: Lightweight, 2-4MB adapters
r=16: Standard, good quality
r=32: High capacity, use for complex tasks
r=64: Maximum quality, 4× larger adapters

Step 3: Train with LoRA

litgpt finetune_lora \
  microsoft/phi-2 \
  --data JSON \
  --data.json_path data/my_dataset.json \
  --lora_r 16 \
  --train.epochs 3 \
  --train.learning_rate 1e-4 \
  --train.micro_batch_size 4 \
  --train.global_batch_size 32 \
  --out_dir out/phi2-lora

# Memory usage: ~8-12GB for Phi-2 with LoRA

Step 4: Merge LoRA weights (optional)

Merge LoRA adapters into base model for deployment:

litgpt merge_lora \
  out/phi2-lora/final \
  --out_dir out/phi2-merged

Now use merged model:

from litgpt import LLM
llm = LLM.load("out/phi2-merged")

Workflow 3: Pretrain from scratch

Train new model on your domain data.

Pretraining:
- [ ] Step 1: Prepare pretraining dataset
- [ ] Step 2: Configure model architecture
- [ ] Step 3: Set up multi-GPU training
- [ ] Step 4: Launch pretraining

Step 1: Prepare pretraining dataset

LitGPT expects tokenized data. Use prepare_dataset.py:

python scripts/prepare_dataset.py \
  --source_path data/my_corpus.txt \
  --checkpoint_dir checkpoints/tokenizer \
  --destination_path data/pretrain \
  --split train,val

Step 2: Configure model architecture

Edit config file or use existing:

# config/pythia-160m.yaml
model_name: pythia-160m
block_size: 2048
vocab_size: 50304
n_layer: 12
n_head: 12
n_embd: 768
rotary_percentage: 0.25
parallel_residual: true
bias: true

Step 3: Set up multi-GPU training

# Single GPU
litgpt pretrain \
  --config config/pythia-160m.yaml \
  --data.data_dir data/pretrain \
  --train.max_tokens 10_000_000_000

# Multi-GPU with FSDP
litgpt pretrain \
  --config config/pythia-1b.yaml \
  --data.data_dir data/pretrain \
  --devices 8 \
  --train.max_tokens 100_000_000_000

Step 4: Launch pretraining

For large-scale pretraining on cluster:

# Using SLURM
sbatch --nodes=8 --gpus-per-node=8 \
  pretrain_script.sh

# pretrain_script.sh content:
litgpt pretrain \
  --config config/pythia-1b.yaml \
  --data.data_dir /shared/data/pretrain \
  --devices 8 \
  --num_nodes 8 \
  --train.global_batch_size 512 \
  --train.max_tokens 300_000_000_000

Workflow 4: Convert and deploy model

Export LitGPT models for production.

Model Deployment:
- [ ] Step 1: Test inference locally
- [ ] Step 2: Quantize model (optional)
- [ ] Step 3: Convert to GGUF (for llama.cpp)
- [ ] Step 4: Deploy with API

Step 1: Test inference locally

from litgpt import LLM

llm = LLM.load("out/phi2-lora/final")

# Single generation
print(llm.generate("What is machine learning?"))

# Streaming
for token in llm.generate("Explain quantum computing", stream=True):
    print(token, end="", flush=True)

# Batch inference
prompts = ["Hello", "Goodbye", "Thank you"]
results = [llm.generate(p) for p in prompts]

Step 2: Quantize model (optional)

Reduce model size with minimal quality loss:

# 8-bit quantization (50% size reduction)
litgpt convert_lit_checkpoint \
  out/phi2-lora/final \
  --dtype bfloat16 \
  --quantize bnb.nf4

# 4-bit quantization (75% size reduction)
litgpt convert_lit_checkpoint \
  out/phi2-lora/final \
  --quantize bnb.nf4-dq  # Double quantization

Step 3: Convert to GGUF (for llama.cpp)

python scripts/convert_lit_checkpoint.py \
  --checkpoint_path out/phi2-lora/final \
  --output_path models/phi2.gguf \
  --model_name microsoft/phi-2

Step 4: Deploy with API

from fastapi import FastAPI
from litgpt import LLM

app = FastAPI()
llm = LLM.load("out/phi2-lora/final")

@app.post("/generate")
def generate(prompt: str, max_tokens: int = 100):
    result = llm.generate(
        prompt,
        max_new_tokens=max_tokens,
        temperature=0.7
    )
    return {"response": result}

# Run: uvicorn api:app --host 0.0.0.0 --port 8000

When to use vs alternatives

Use LitGPT when:

Want to understand LLM architectures (clean, readable code)
Need production-ready training recipes
Educational purposes or research
Prototyping new model ideas
Lightning ecosystem user

Use alternatives instead:

Axolotl/TRL: More fine-tuning features, YAML configs
Megatron-Core: Maximum performance for >70B models
HuggingFace Transformers: Broadest model support
vLLM: Inference-only (no training)

Common issues

Issue: Out of memory during fine-tuning

Use LoRA instead of full fine-tuning:

# Instead of litgpt finetune (requires 40GB+)
litgpt finetune_lora  # Only needs 12-16GB

Or enable gradient checkpointing:

litgpt finetune_lora \
  ... \
  --train.gradient_accumulation_iters 4  # Accumulate gradients

Issue: Training too slow

Enable Flash Attention (built-in, automatic on compatible hardware):

# Already enabled by default on Ampere+ GPUs (A100, RTX 30/40 series)
# No configuration needed

Use smaller micro-batch and accumulate:

--train.micro_batch_size 1 \
--train.global_batch_size 32 \
--train.gradient_accumulation_iters 32  # Effective batch=32

Issue: Model not loading

Check model name:

# List all available models
litgpt download list

# Download if not exists
litgpt download meta-llama/Meta-Llama-3-8B

Verify checkpoints directory:

ls checkpoints/
# Should see: meta-llama/Meta-Llama-3-8B/

Issue: LoRA adapters too large

Reduce LoRA rank:

--lora_r 8  # Instead of 16 or 32

Apply LoRA to fewer layers:

--lora_query true \
--lora_value true \
--lora_projection false \  # Disable this
--lora_mlp false  # And this

Advanced topics

Supported architectures: See references/supported-models.md for complete list of 20+ model families with sizes and capabilities.

Training recipes: See references/training-recipes.md for proven hyperparameter configurations for pretraining and fine-tuning.

FSDP configuration: See references/distributed-training.md for multi-GPU training with Fully Sharded Data Parallel.

Custom architectures: See references/custom-models.md for implementing new model architectures in LitGPT style.

Hardware requirements

GPU: NVIDIA (CUDA 11.8+), AMD (ROCm), Apple Silicon (MPS)
Memory:
- Inference (Phi-2): 6GB
- LoRA fine-tuning (7B): 16GB
- Full fine-tuning (7B): 40GB+
- Pretraining (1B): 24GB
Storage: 5-50GB per model (depending on size)

Resources

GitHub: https://github.com/Lightning-AI/litgpt
Docs: https://lightning.ai/docs/litgpt
Tutorials: https://lightning.ai/docs/litgpt/tutorials
Model zoo: 20+ pretrained architectures (Llama, Gemma, Phi, Qwen, Mistral, Mixtral, Falcon, etc.)

GitHub Repository

davila7/claude-code-templates

Path: cli-tool/components/skills/ai-research/model-architecture-litgpt

anthropicanthropic-claudeclaudeclaude-code

Related Skills

quantizing-models-bitsandbytes

Other

This skill quantizes LLMs to 8-bit or 4-bit precision using bitsandbytes, reducing memory usage by 50-75% with minimal accuracy loss for GPU-constrained environments. It supports multiple formats (INT8, NF4, FP4) and enables QLoRA training and 8-bit optimizers. Use it with HuggingFace Transformers when you need to fit larger models into limited memory or accelerate inference.

View skill

axolotl

Design

This skill provides expert guidance for fine-tuning LLMs using the Axolotl framework, helping developers configure YAML files and implement advanced techniques like LoRA/QLoRA and DPO/KTO. Use it when working with Axolotl features, debugging code, or learning best practices for fine-tuning across 100+ models. It offers comprehensive assistance including multimodal support and performance optimization.

View skill

unsloth

Design

This skill provides expert guidance for fast fine-tuning with Unsloth, offering 2-5x faster training and 50-80% memory reduction. It helps developers implement and debug LoRA/QLoRA optimizations for models like Llama and Mistral. Use it when working with Unsloth's APIs, features, or best practices for efficient model training.

View skill

peft-fine-tuning

Other

This skill enables parameter-efficient fine-tuning of large language models using LoRA, QLoRA, and other adapter methods, drastically reducing GPU memory requirements. It's ideal for fine-tuning 7B-70B models on consumer hardware by training less than 1% of parameters while maintaining accuracy. The integration with Hugging Face's ecosystem supports multi-adapter serving and rapid iteration with task-specific adapters.

View skill