simpo-training

davila7

Updated Today

5 views

15,516

1,344

15,516

View on GitHub

OtherPost-TrainingSimPOPreference OptimizationAlignmentDPO AlternativeReference-FreeLLM AlignmentEfficient Training

About

SimPO is a reference-free LLM alignment method that serves as a simpler, more efficient alternative to DPO. It eliminates the need for a reference model while achieving better performance, such as a +6.4 point improvement on AlpacaEval 2.0. Use it when you need faster, more straightforward preference alignment training compared to DPO or PPO.

Quick Install

Claude Code

Recommended

Plugin CommandRecommended

/plugin add https://github.com/davila7/claude-code-templates

Git CloneAlternative

git clone https://github.com/davila7/claude-code-templates.git ~/.claude/skills/simpo-training

Copy and paste this command in Claude Code to install this skill

Documentation

SimPO - Simple Preference Optimization

Quick start

SimPO is a reference-free preference optimization method that outperforms DPO without needing a reference model.

Installation:

# Create environment
conda create -n simpo python=3.10 && conda activate simpo

# Install PyTorch 2.2.2
# Visit: https://pytorch.org/get-started/locally/

# Install alignment-handbook
git clone https://github.com/huggingface/alignment-handbook.git
cd alignment-handbook
python -m pip install .

# Install Flash Attention 2
python -m pip install flash-attn --no-build-isolation

Training (Mistral 7B):

ACCELERATE_LOG_LEVEL=info accelerate launch \
  --config_file accelerate_configs/deepspeed_zero3.yaml \
  scripts/run_simpo.py \
  training_configs/mistral-7b-base-simpo.yaml

Common workflows

Workflow 1: Train from base model (Mistral 7B)

Config (mistral-7b-base-simpo.yaml):

# Model
model_name_or_path: mistralai/Mistral-7B-v0.1
torch_dtype: bfloat16

# Dataset
dataset_mixer:
  HuggingFaceH4/ultrafeedback_binarized: 1.0
dataset_splits:
  - train_prefs
  - test_prefs

# SimPO hyperparameters
beta: 2.0                  # Reward scaling (2.0-10.0)
gamma_beta_ratio: 0.5       # Target margin (0-1)
loss_type: sigmoid          # sigmoid or hinge
sft_weight: 0.0             # Optional SFT regularization

# Training
learning_rate: 5e-7         # Critical: 3e-7 to 1e-6
num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 8

# Output
output_dir: ./outputs/mistral-7b-simpo

Launch training:

accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml \
  scripts/run_simpo.py training_configs/mistral-7b-base-simpo.yaml

Workflow 2: Fine-tune instruct model (Llama 3 8B)

Config (llama3-8b-instruct-simpo.yaml):

model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct

dataset_mixer:
  argilla/ultrafeedback-binarized-preferences-cleaned: 1.0

beta: 2.5
gamma_beta_ratio: 0.5
learning_rate: 5e-7
sft_weight: 0.1             # Add SFT loss to preserve capabilities

num_train_epochs: 1
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
output_dir: ./outputs/llama3-8b-simpo

Launch:

accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml \
  scripts/run_simpo.py training_configs/llama3-8b-instruct-simpo.yaml

Workflow 3: Reasoning-intensive tasks (lower LR)

For math/code tasks:

model_name_or_path: deepseek-ai/deepseek-math-7b-base

dataset_mixer:
  argilla/distilabel-math-preference-dpo: 1.0

beta: 5.0                   # Higher for stronger signal
gamma_beta_ratio: 0.7       # Larger margin
learning_rate: 3e-7         # Lower LR for reasoning
sft_weight: 0.0

num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 16

When to use vs alternatives

Use SimPO when:

Want simpler training than DPO (no reference model)
Have preference data (chosen/rejected pairs)
Need better performance than DPO
Limited compute resources
Single-node training sufficient

Algorithm selection:

SimPO: Simplest, best performance, no reference model
DPO: Need reference model baseline, more conservative
PPO: Maximum control, need reward model, complex setup
GRPO: Memory-efficient RL, no critic

Use alternatives instead:

OpenRLHF: Multi-node distributed training, PPO/GRPO
TRL: Need multiple methods in one framework
DPO: Established baseline comparison

Common issues

Issue: Loss divergence

Reduce learning rate:

learning_rate: 3e-7  # Reduce from 5e-7

Reduce beta:

beta: 1.0  # Reduce from 2.0

Issue: Model forgets capabilities

Add SFT regularization:

sft_weight: 0.1  # Add SFT loss component

Issue: Poor preference separation

Increase beta and margin:

beta: 5.0            # Increase from 2.0
gamma_beta_ratio: 0.8  # Increase from 0.5

Issue: OOM during training

Reduce batch size:

per_device_train_batch_size: 1
gradient_accumulation_steps: 16  # Maintain effective batch

Enable gradient checkpointing:

gradient_checkpointing: true

Advanced topics

Loss functions: See references/loss-functions.md for sigmoid vs hinge loss, mathematical formulations, and when to use each.

Hyperparameter tuning: See references/hyperparameters.md for beta, gamma, learning rate selection guide, and model-size-specific recommendations.

Dataset preparation: See references/datasets.md for preference data formats, quality filtering, and custom dataset creation.

Hardware requirements

GPU: NVIDIA A100/H100 recommended
VRAM:
- 7B model: 1× A100 40GB (DeepSpeed ZeRO-3)
- 8B model: 2× A100 40GB
- 70B model: 8× A100 80GB
Single-node: DeepSpeed ZeRO-3 sufficient
Mixed precision: BF16 recommended

Memory optimization:

DeepSpeed ZeRO-3 (default config)
Gradient checkpointing
Flash Attention 2

Resources

Paper: https://arxiv.org/abs/2405.14734 (NeurIPS 2024)
GitHub: https://github.com/princeton-nlp/SimPO
Models: https://huggingface.co/princeton-nlp
Alignment Handbook: https://github.com/huggingface/alignment-handbook

GitHub Repository

davila7/claude-code-templates

Path: cli-tool/components/skills/ai-research/post-training-simpo

anthropicanthropic-claudeclaudeclaude-code

Related Skills

moe-training

Other

This skill enables training Mixture of Experts (MoE) models using DeepSpeed or HuggingFace, offering a 5x cost reduction versus dense models for large-scale training. It's ideal for implementing sparse architectures like Mixtral 8x7B and scaling model capacity without a proportional compute increase. The skill covers core MoE components including architectures, routing, load balancing, expert parallelism, and inference optimization.

View skill

fine-tuning-with-trl

Other

This skill enables fine-tuning LLMs with TRL's reinforcement learning methods including SFT, DPO, and PPO for RLHF and preference alignment. Use it when you need to align models with human feedback or optimize for specific rewards using HuggingFace Transformers. It provides a complete toolkit for instruction tuning, preference alignment, and reward model training.

View skill

grpo-rl-training

Design

This skill provides expert guidance for implementing GRPO (Group Relative Policy Optimization) reinforcement learning fine-tuning using the TRL library. It's designed for training models on tasks requiring structured outputs, verifiable reasoning, or specific formats like JSON/XML. Use it when you need to fine-tune language models with custom reward functions for objective, task-specific improvements.

View skill

openrlhf-training

Design

OpenRLHF is a high-performance RLHF training framework for fine-tuning large language models (7B-70B+ parameters) using methods like PPO, DPO, and GRPO. It leverages Ray for distributed computing and vLLM for accelerated inference, achieving speeds twice as fast as DeepSpeedChat. Use this skill when you need efficient, distributed RLHF training with optimized GPU resource sharing and ZeRO-3 support.

View skill