Back to Skills

implementing-llms-litgpt

zechenzhangAGI
Updated Yesterday
39 views
62
2
62
View on GitHub
Otherai

About

This Claude Skill implements and trains LLMs using Lightning AI's LitGPT framework, featuring 20+ pretrained architectures like Llama and Mistral. It provides clean, single-file implementations without abstraction layers, ideal for educational understanding or production fine-tuning with LoRA/QLoRA. Use it when you need readable model code or efficient fine-tuning workflows.

Quick Install

Claude Code

Recommended
Plugin CommandRecommended
/plugin add https://github.com/zechenzhangAGI/AI-research-SKILLs
Git CloneAlternative
git clone https://github.com/zechenzhangAGI/AI-research-SKILLs.git ~/.claude/skills/implementing-llms-litgpt

Copy and paste this command in Claude Code to install this skill

Documentation

LitGPT - Clean LLM Implementations

Quick start

LitGPT provides 20+ pretrained LLM implementations with clean, readable code and production-ready training workflows.

Installation:

pip install 'litgpt[extra]'

Load and use any model:

from litgpt import LLM

# Load pretrained model
llm = LLM.load("microsoft/phi-2")

# Generate text
result = llm.generate(
    "What is the capital of France?",
    max_new_tokens=50,
    temperature=0.7
)
print(result)

List available models:

litgpt download list

Common workflows

Workflow 1: Fine-tune on custom dataset

Copy this checklist:

Fine-Tuning Setup:
- [ ] Step 1: Download pretrained model
- [ ] Step 2: Prepare dataset
- [ ] Step 3: Configure training
- [ ] Step 4: Run fine-tuning

Step 1: Download pretrained model

# Download Llama 3 8B
litgpt download meta-llama/Meta-Llama-3-8B

# Download Phi-2 (smaller, faster)
litgpt download microsoft/phi-2

# Download Gemma 2B
litgpt download google/gemma-2b

Models are saved to checkpoints/ directory.

Step 2: Prepare dataset

LitGPT supports multiple formats:

Alpaca format (instruction-response):

[
  {
    "instruction": "What is the capital of France?",
    "input": "",
    "output": "The capital of France is Paris."
  },
  {
    "instruction": "Translate to Spanish: Hello, how are you?",
    "input": "",
    "output": "Hola, ¿cómo estás?"
  }
]

Save as data/my_dataset.json.

Step 3: Configure training

# Full fine-tuning (requires 40GB+ GPU for 7B models)
litgpt finetune \
  meta-llama/Meta-Llama-3-8B \
  --data JSON \
  --data.json_path data/my_dataset.json \
  --train.max_steps 1000 \
  --train.learning_rate 2e-5 \
  --train.micro_batch_size 1 \
  --train.global_batch_size 16

# LoRA fine-tuning (efficient, 16GB GPU)
litgpt finetune_lora \
  microsoft/phi-2 \
  --data JSON \
  --data.json_path data/my_dataset.json \
  --lora_r 16 \
  --lora_alpha 32 \
  --lora_dropout 0.05 \
  --train.max_steps 1000 \
  --train.learning_rate 1e-4

Step 4: Run fine-tuning

Training saves checkpoints to out/finetune/ automatically.

Monitor training:

# View logs
tail -f out/finetune/logs.txt

# TensorBoard (if using --train.logger_name tensorboard)
tensorboard --logdir out/finetune/lightning_logs

Workflow 2: LoRA fine-tuning on single GPU

Most memory-efficient option.

LoRA Training:
- [ ] Step 1: Choose base model
- [ ] Step 2: Configure LoRA parameters
- [ ] Step 3: Train with LoRA
- [ ] Step 4: Merge LoRA weights (optional)

Step 1: Choose base model

For limited GPU memory (12-16GB):

  • Phi-2 (2.7B) - Best quality/size tradeoff
  • Llama 3 1B - Smallest, fastest
  • Gemma 2B - Good reasoning

Step 2: Configure LoRA parameters

litgpt finetune_lora \
  microsoft/phi-2 \
  --data JSON \
  --data.json_path data/my_dataset.json \
  --lora_r 16 \          # LoRA rank (8-64, higher=more capacity)
  --lora_alpha 32 \      # LoRA scaling (typically 2×r)
  --lora_dropout 0.05 \  # Prevent overfitting
  --lora_query true \    # Apply LoRA to query projection
  --lora_key false \     # Usually not needed
  --lora_value true \    # Apply LoRA to value projection
  --lora_projection true \  # Apply LoRA to output projection
  --lora_mlp false \     # Usually not needed
  --lora_head false      # Usually not needed

LoRA rank guide:

  • r=8: Lightweight, 2-4MB adapters
  • r=16: Standard, good quality
  • r=32: High capacity, use for complex tasks
  • r=64: Maximum quality, 4× larger adapters

Step 3: Train with LoRA

litgpt finetune_lora \
  microsoft/phi-2 \
  --data JSON \
  --data.json_path data/my_dataset.json \
  --lora_r 16 \
  --train.epochs 3 \
  --train.learning_rate 1e-4 \
  --train.micro_batch_size 4 \
  --train.global_batch_size 32 \
  --out_dir out/phi2-lora

# Memory usage: ~8-12GB for Phi-2 with LoRA

Step 4: Merge LoRA weights (optional)

Merge LoRA adapters into base model for deployment:

litgpt merge_lora \
  out/phi2-lora/final \
  --out_dir out/phi2-merged

Now use merged model:

from litgpt import LLM
llm = LLM.load("out/phi2-merged")

Workflow 3: Pretrain from scratch

Train new model on your domain data.

Pretraining:
- [ ] Step 1: Prepare pretraining dataset
- [ ] Step 2: Configure model architecture
- [ ] Step 3: Set up multi-GPU training
- [ ] Step 4: Launch pretraining

Step 1: Prepare pretraining dataset

LitGPT expects tokenized data. Use prepare_dataset.py:

python scripts/prepare_dataset.py \
  --source_path data/my_corpus.txt \
  --checkpoint_dir checkpoints/tokenizer \
  --destination_path data/pretrain \
  --split train,val

Step 2: Configure model architecture

Edit config file or use existing:

# config/pythia-160m.yaml
model_name: pythia-160m
block_size: 2048
vocab_size: 50304
n_layer: 12
n_head: 12
n_embd: 768
rotary_percentage: 0.25
parallel_residual: true
bias: true

Step 3: Set up multi-GPU training

# Single GPU
litgpt pretrain \
  --config config/pythia-160m.yaml \
  --data.data_dir data/pretrain \
  --train.max_tokens 10_000_000_000

# Multi-GPU with FSDP
litgpt pretrain \
  --config config/pythia-1b.yaml \
  --data.data_dir data/pretrain \
  --devices 8 \
  --train.max_tokens 100_000_000_000

Step 4: Launch pretraining

For large-scale pretraining on cluster:

# Using SLURM
sbatch --nodes=8 --gpus-per-node=8 \
  pretrain_script.sh

# pretrain_script.sh content:
litgpt pretrain \
  --config config/pythia-1b.yaml \
  --data.data_dir /shared/data/pretrain \
  --devices 8 \
  --num_nodes 8 \
  --train.global_batch_size 512 \
  --train.max_tokens 300_000_000_000

Workflow 4: Convert and deploy model

Export LitGPT models for production.

Model Deployment:
- [ ] Step 1: Test inference locally
- [ ] Step 2: Quantize model (optional)
- [ ] Step 3: Convert to GGUF (for llama.cpp)
- [ ] Step 4: Deploy with API

Step 1: Test inference locally

from litgpt import LLM

llm = LLM.load("out/phi2-lora/final")

# Single generation
print(llm.generate("What is machine learning?"))

# Streaming
for token in llm.generate("Explain quantum computing", stream=True):
    print(token, end="", flush=True)

# Batch inference
prompts = ["Hello", "Goodbye", "Thank you"]
results = [llm.generate(p) for p in prompts]

Step 2: Quantize model (optional)

Reduce model size with minimal quality loss:

# 8-bit quantization (50% size reduction)
litgpt convert_lit_checkpoint \
  out/phi2-lora/final \
  --dtype bfloat16 \
  --quantize bnb.nf4

# 4-bit quantization (75% size reduction)
litgpt convert_lit_checkpoint \
  out/phi2-lora/final \
  --quantize bnb.nf4-dq  # Double quantization

Step 3: Convert to GGUF (for llama.cpp)

python scripts/convert_lit_checkpoint.py \
  --checkpoint_path out/phi2-lora/final \
  --output_path models/phi2.gguf \
  --model_name microsoft/phi-2

Step 4: Deploy with API

from fastapi import FastAPI
from litgpt import LLM

app = FastAPI()
llm = LLM.load("out/phi2-lora/final")

@app.post("/generate")
def generate(prompt: str, max_tokens: int = 100):
    result = llm.generate(
        prompt,
        max_new_tokens=max_tokens,
        temperature=0.7
    )
    return {"response": result}

# Run: uvicorn api:app --host 0.0.0.0 --port 8000

When to use vs alternatives

Use LitGPT when:

  • Want to understand LLM architectures (clean, readable code)
  • Need production-ready training recipes
  • Educational purposes or research
  • Prototyping new model ideas
  • Lightning ecosystem user

Use alternatives instead:

  • Axolotl/TRL: More fine-tuning features, YAML configs
  • Megatron-Core: Maximum performance for >70B models
  • HuggingFace Transformers: Broadest model support
  • vLLM: Inference-only (no training)

Common issues

Issue: Out of memory during fine-tuning

Use LoRA instead of full fine-tuning:

# Instead of litgpt finetune (requires 40GB+)
litgpt finetune_lora  # Only needs 12-16GB

Or enable gradient checkpointing:

litgpt finetune_lora \
  ... \
  --train.gradient_accumulation_iters 4  # Accumulate gradients

Issue: Training too slow

Enable Flash Attention (built-in, automatic on compatible hardware):

# Already enabled by default on Ampere+ GPUs (A100, RTX 30/40 series)
# No configuration needed

Use smaller micro-batch and accumulate:

--train.micro_batch_size 1 \
--train.global_batch_size 32 \
--train.gradient_accumulation_iters 32  # Effective batch=32

Issue: Model not loading

Check model name:

# List all available models
litgpt download list

# Download if not exists
litgpt download meta-llama/Meta-Llama-3-8B

Verify checkpoints directory:

ls checkpoints/
# Should see: meta-llama/Meta-Llama-3-8B/

Issue: LoRA adapters too large

Reduce LoRA rank:

--lora_r 8  # Instead of 16 or 32

Apply LoRA to fewer layers:

--lora_query true \
--lora_value true \
--lora_projection false \  # Disable this
--lora_mlp false  # And this

Advanced topics

Supported architectures: See references/supported-models.md for complete list of 20+ model families with sizes and capabilities.

Training recipes: See references/training-recipes.md for proven hyperparameter configurations for pretraining and fine-tuning.

FSDP configuration: See references/distributed-training.md for multi-GPU training with Fully Sharded Data Parallel.

Custom architectures: See references/custom-models.md for implementing new model architectures in LitGPT style.

Hardware requirements

  • GPU: NVIDIA (CUDA 11.8+), AMD (ROCm), Apple Silicon (MPS)
  • Memory:
    • Inference (Phi-2): 6GB
    • LoRA fine-tuning (7B): 16GB
    • Full fine-tuning (7B): 40GB+
    • Pretraining (1B): 24GB
  • Storage: 5-50GB per model (depending on size)

Resources

GitHub Repository

zechenzhangAGI/AI-research-SKILLs
Path: 01-model-architecture/litgpt
aiai-researchclaudeclaude-codeclaude-skillscodex

Related Skills

sglang

Meta

SGLang is a high-performance LLM serving framework that specializes in fast, structured generation for JSON, regex, and agentic workflows using its RadixAttention prefix caching. It delivers significantly faster inference, especially for tasks with repeated prefixes, making it ideal for complex, structured outputs and multi-turn conversations. Choose SGLang over alternatives like vLLM when you need constrained decoding or are building applications with extensive prefix sharing.

View skill

evaluating-llms-harness

Testing

This Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.

View skill

llamaguard

Other

LlamaGuard is Meta's 7-8B parameter model for moderating LLM inputs and outputs across six safety categories like violence and hate speech. It offers 94-95% accuracy and can be deployed using vLLM, Hugging Face, or Amazon SageMaker. Use this skill to easily integrate content filtering and safety guardrails into your AI applications.

View skill

langchain

Meta

LangChain is a framework for building LLM applications using agents, chains, and RAG pipelines. It supports multiple LLM providers, offers 500+ integrations, and includes features like tool calling and memory management. Use it for rapid prototyping and deploying production systems like chatbots, autonomous agents, and question-answering services.

View skill