simpo-training
About
SimPO is a reference-free LLM alignment method that serves as a simpler, more efficient alternative to DPO. It eliminates the need for a reference model while achieving better performance, such as a +6.4 point improvement on AlpacaEval 2.0. Use it when you need faster, more straightforward preference alignment training compared to DPO or PPO.
Quick Install
Claude Code
Recommended/plugin add https://github.com/davila7/claude-code-templatesgit clone https://github.com/davila7/claude-code-templates.git ~/.claude/skills/simpo-trainingCopy and paste this command in Claude Code to install this skill
Documentation
SimPO - Simple Preference Optimization
Quick start
SimPO is a reference-free preference optimization method that outperforms DPO without needing a reference model.
Installation:
# Create environment
conda create -n simpo python=3.10 && conda activate simpo
# Install PyTorch 2.2.2
# Visit: https://pytorch.org/get-started/locally/
# Install alignment-handbook
git clone https://github.com/huggingface/alignment-handbook.git
cd alignment-handbook
python -m pip install .
# Install Flash Attention 2
python -m pip install flash-attn --no-build-isolation
Training (Mistral 7B):
ACCELERATE_LOG_LEVEL=info accelerate launch \
--config_file accelerate_configs/deepspeed_zero3.yaml \
scripts/run_simpo.py \
training_configs/mistral-7b-base-simpo.yaml
Common workflows
Workflow 1: Train from base model (Mistral 7B)
Config (mistral-7b-base-simpo.yaml):
# Model
model_name_or_path: mistralai/Mistral-7B-v0.1
torch_dtype: bfloat16
# Dataset
dataset_mixer:
HuggingFaceH4/ultrafeedback_binarized: 1.0
dataset_splits:
- train_prefs
- test_prefs
# SimPO hyperparameters
beta: 2.0 # Reward scaling (2.0-10.0)
gamma_beta_ratio: 0.5 # Target margin (0-1)
loss_type: sigmoid # sigmoid or hinge
sft_weight: 0.0 # Optional SFT regularization
# Training
learning_rate: 5e-7 # Critical: 3e-7 to 1e-6
num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
# Output
output_dir: ./outputs/mistral-7b-simpo
Launch training:
accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml \
scripts/run_simpo.py training_configs/mistral-7b-base-simpo.yaml
Workflow 2: Fine-tune instruct model (Llama 3 8B)
Config (llama3-8b-instruct-simpo.yaml):
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
dataset_mixer:
argilla/ultrafeedback-binarized-preferences-cleaned: 1.0
beta: 2.5
gamma_beta_ratio: 0.5
learning_rate: 5e-7
sft_weight: 0.1 # Add SFT loss to preserve capabilities
num_train_epochs: 1
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
output_dir: ./outputs/llama3-8b-simpo
Launch:
accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml \
scripts/run_simpo.py training_configs/llama3-8b-instruct-simpo.yaml
Workflow 3: Reasoning-intensive tasks (lower LR)
For math/code tasks:
model_name_or_path: deepseek-ai/deepseek-math-7b-base
dataset_mixer:
argilla/distilabel-math-preference-dpo: 1.0
beta: 5.0 # Higher for stronger signal
gamma_beta_ratio: 0.7 # Larger margin
learning_rate: 3e-7 # Lower LR for reasoning
sft_weight: 0.0
num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
When to use vs alternatives
Use SimPO when:
- Want simpler training than DPO (no reference model)
- Have preference data (chosen/rejected pairs)
- Need better performance than DPO
- Limited compute resources
- Single-node training sufficient
Algorithm selection:
- SimPO: Simplest, best performance, no reference model
- DPO: Need reference model baseline, more conservative
- PPO: Maximum control, need reward model, complex setup
- GRPO: Memory-efficient RL, no critic
Use alternatives instead:
- OpenRLHF: Multi-node distributed training, PPO/GRPO
- TRL: Need multiple methods in one framework
- DPO: Established baseline comparison
Common issues
Issue: Loss divergence
Reduce learning rate:
learning_rate: 3e-7 # Reduce from 5e-7
Reduce beta:
beta: 1.0 # Reduce from 2.0
Issue: Model forgets capabilities
Add SFT regularization:
sft_weight: 0.1 # Add SFT loss component
Issue: Poor preference separation
Increase beta and margin:
beta: 5.0 # Increase from 2.0
gamma_beta_ratio: 0.8 # Increase from 0.5
Issue: OOM during training
Reduce batch size:
per_device_train_batch_size: 1
gradient_accumulation_steps: 16 # Maintain effective batch
Enable gradient checkpointing:
gradient_checkpointing: true
Advanced topics
Loss functions: See references/loss-functions.md for sigmoid vs hinge loss, mathematical formulations, and when to use each.
Hyperparameter tuning: See references/hyperparameters.md for beta, gamma, learning rate selection guide, and model-size-specific recommendations.
Dataset preparation: See references/datasets.md for preference data formats, quality filtering, and custom dataset creation.
Hardware requirements
- GPU: NVIDIA A100/H100 recommended
- VRAM:
- 7B model: 1× A100 40GB (DeepSpeed ZeRO-3)
- 8B model: 2× A100 40GB
- 70B model: 8× A100 80GB
- Single-node: DeepSpeed ZeRO-3 sufficient
- Mixed precision: BF16 recommended
Memory optimization:
- DeepSpeed ZeRO-3 (default config)
- Gradient checkpointing
- Flash Attention 2
Resources
- Paper: https://arxiv.org/abs/2405.14734 (NeurIPS 2024)
- GitHub: https://github.com/princeton-nlp/SimPO
- Models: https://huggingface.co/princeton-nlp
- Alignment Handbook: https://github.com/huggingface/alignment-handbook
GitHub Repository
Related Skills
moe-training
OtherThis skill enables training Mixture of Experts (MoE) models using DeepSpeed or HuggingFace, offering a 5x cost reduction versus dense models for large-scale training. It's ideal for implementing sparse architectures like Mixtral 8x7B and scaling model capacity without a proportional compute increase. The skill covers core MoE components including architectures, routing, load balancing, expert parallelism, and inference optimization.
fine-tuning-with-trl
OtherThis skill enables fine-tuning LLMs with TRL's reinforcement learning methods including SFT, DPO, and PPO for RLHF and preference alignment. Use it when you need to align models with human feedback or optimize for specific rewards using HuggingFace Transformers. It provides a complete toolkit for instruction tuning, preference alignment, and reward model training.
grpo-rl-training
DesignThis skill provides expert guidance for implementing GRPO (Group Relative Policy Optimization) reinforcement learning fine-tuning using the TRL library. It's designed for training models on tasks requiring structured outputs, verifiable reasoning, or specific formats like JSON/XML. Use it when you need to fine-tune language models with custom reward functions for objective, task-specific improvements.
openrlhf-training
DesignOpenRLHF is a high-performance RLHF training framework for fine-tuning large language models (7B-70B+ parameters) using methods like PPO, DPO, and GRPO. It leverages Ray for distributed computing and vLLM for accelerated inference, achieving speeds twice as fast as DeepSpeedChat. Use this skill when you need efficient, distributed RLHF training with optimized GPU resource sharing and ZeRO-3 support.
