Back to Skills

huggingface-accelerate

davila7
Updated Today
177 views
18,478
1,685
18,478
View on GitHub
DevelopmentDistributed TrainingHuggingFaceAccelerateDeepSpeedFSDPMixed PrecisionPyTorchDDPUnified APISimple

About

HuggingFace Accelerate provides a unified API for distributed PyTorch training with just 4 lines of code. It automatically handles device placement, mixed precision, and supports frameworks like DeepSpeed, FSDP, and DDP. Use it to easily add distributed training to any PyTorch script within the HuggingFace ecosystem.

Quick Install

Claude Code

Recommended
Primary
npx skills add davila7/claude-code-templates
Plugin CommandAlternative
/plugin add https://github.com/davila7/claude-code-templates
Git CloneAlternative
git clone https://github.com/davila7/claude-code-templates.git ~/.claude/skills/huggingface-accelerate

Copy and paste this command in Claude Code to install this skill

Documentation

HuggingFace Accelerate - Unified Distributed Training

Quick start

Accelerate simplifies distributed training to 4 lines of code.

Installation:

pip install accelerate

Convert PyTorch script (4 lines):

import torch
+ from accelerate import Accelerator

+ accelerator = Accelerator()

  model = torch.nn.Transformer()
  optimizer = torch.optim.Adam(model.parameters())
  dataloader = torch.utils.data.DataLoader(dataset)

+ model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

  for batch in dataloader:
      optimizer.zero_grad()
      loss = model(batch)
-     loss.backward()
+     accelerator.backward(loss)
      optimizer.step()

Run (single command):

accelerate launch train.py

Common workflows

Workflow 1: From single GPU to multi-GPU

Original script:

# train.py
import torch

model = torch.nn.Linear(10, 2).to('cuda')
optimizer = torch.optim.Adam(model.parameters())
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)

for epoch in range(10):
    for batch in dataloader:
        batch = batch.to('cuda')
        optimizer.zero_grad()
        loss = model(batch).mean()
        loss.backward()
        optimizer.step()

With Accelerate (4 lines added):

# train.py
import torch
from accelerate import Accelerator  # +1

accelerator = Accelerator()  # +2

model = torch.nn.Linear(10, 2)
optimizer = torch.optim.Adam(model.parameters())
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)

model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)  # +3

for epoch in range(10):
    for batch in dataloader:
        # No .to('cuda') needed - automatic!
        optimizer.zero_grad()
        loss = model(batch).mean()
        accelerator.backward(loss)  # +4
        optimizer.step()

Configure (interactive):

accelerate config

Questions:

  • Which machine? (single/multi GPU/TPU/CPU)
  • How many machines? (1)
  • Mixed precision? (no/fp16/bf16/fp8)
  • DeepSpeed? (no/yes)

Launch (works on any setup):

# Single GPU
accelerate launch train.py

# Multi-GPU (8 GPUs)
accelerate launch --multi_gpu --num_processes 8 train.py

# Multi-node
accelerate launch --multi_gpu --num_processes 16 \
  --num_machines 2 --machine_rank 0 \
  --main_process_ip $MASTER_ADDR \
  train.py

Workflow 2: Mixed precision training

Enable FP16/BF16:

from accelerate import Accelerator

# FP16 (with gradient scaling)
accelerator = Accelerator(mixed_precision='fp16')

# BF16 (no scaling, more stable)
accelerator = Accelerator(mixed_precision='bf16')

# FP8 (H100+)
accelerator = Accelerator(mixed_precision='fp8')

model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

# Everything else is automatic!
for batch in dataloader:
    with accelerator.autocast():  # Optional, done automatically
        loss = model(batch)
    accelerator.backward(loss)

Workflow 3: DeepSpeed ZeRO integration

Enable DeepSpeed ZeRO-2:

from accelerate import Accelerator

accelerator = Accelerator(
    mixed_precision='bf16',
    deepspeed_plugin={
        "zero_stage": 2,  # ZeRO-2
        "offload_optimizer": False,
        "gradient_accumulation_steps": 4
    }
)

# Same code as before!
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

Or via config:

accelerate config
# Select: DeepSpeed → ZeRO-2

deepspeed_config.json:

{
    "fp16": {"enabled": false},
    "bf16": {"enabled": true},
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {"device": "cpu"},
        "allgather_bucket_size": 5e8,
        "reduce_bucket_size": 5e8
    }
}

Launch:

accelerate launch --config_file deepspeed_config.json train.py

Workflow 4: FSDP (Fully Sharded Data Parallel)

Enable FSDP:

from accelerate import Accelerator, FullyShardedDataParallelPlugin

fsdp_plugin = FullyShardedDataParallelPlugin(
    sharding_strategy="FULL_SHARD",  # ZeRO-3 equivalent
    auto_wrap_policy="TRANSFORMER_AUTO_WRAP",
    cpu_offload=False
)

accelerator = Accelerator(
    mixed_precision='bf16',
    fsdp_plugin=fsdp_plugin
)

model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

Or via config:

accelerate config
# Select: FSDP → Full Shard → No CPU Offload

Workflow 5: Gradient accumulation

Accumulate gradients:

from accelerate import Accelerator

accelerator = Accelerator(gradient_accumulation_steps=4)

model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

for batch in dataloader:
    with accelerator.accumulate(model):  # Handles accumulation
        optimizer.zero_grad()
        loss = model(batch)
        accelerator.backward(loss)
        optimizer.step()

Effective batch size: batch_size * num_gpus * gradient_accumulation_steps

When to use vs alternatives

Use Accelerate when:

  • Want simplest distributed training
  • Need single script for any hardware
  • Use HuggingFace ecosystem
  • Want flexibility (DDP/DeepSpeed/FSDP/Megatron)
  • Need quick prototyping

Key advantages:

  • 4 lines: Minimal code changes
  • Unified API: Same code for DDP, DeepSpeed, FSDP, Megatron
  • Automatic: Device placement, mixed precision, sharding
  • Interactive config: No manual launcher setup
  • Single launch: Works everywhere

Use alternatives instead:

  • PyTorch Lightning: Need callbacks, high-level abstractions
  • Ray Train: Multi-node orchestration, hyperparameter tuning
  • DeepSpeed: Direct API control, advanced features
  • Raw DDP: Maximum control, minimal abstraction

Common issues

Issue: Wrong device placement

Don't manually move to device:

# WRONG
batch = batch.to('cuda')

# CORRECT
# Accelerate handles it automatically after prepare()

Issue: Gradient accumulation not working

Use context manager:

# CORRECT
with accelerator.accumulate(model):
    optimizer.zero_grad()
    accelerator.backward(loss)
    optimizer.step()

Issue: Checkpointing in distributed

Use accelerator methods:

# Save only on main process
if accelerator.is_main_process:
    accelerator.save_state('checkpoint/')

# Load on all processes
accelerator.load_state('checkpoint/')

Issue: Different results with FSDP

Ensure same random seed:

from accelerate.utils import set_seed
set_seed(42)

Advanced topics

Megatron integration: See references/megatron-integration.md for tensor parallelism, pipeline parallelism, and sequence parallelism setup.

Custom plugins: See references/custom-plugins.md for creating custom distributed plugins and advanced configuration.

Performance tuning: See references/performance.md for profiling, memory optimization, and best practices.

Hardware requirements

  • CPU: Works (slow)
  • Single GPU: Works
  • Multi-GPU: DDP (default), DeepSpeed, or FSDP
  • Multi-node: DDP, DeepSpeed, FSDP, Megatron
  • TPU: Supported
  • Apple MPS: Supported

Launcher requirements:

  • DDP: torch.distributed.run (built-in)
  • DeepSpeed: deepspeed (pip install deepspeed)
  • FSDP: PyTorch 1.12+ (built-in)
  • Megatron: Custom setup

Resources

GitHub Repository

davila7/claude-code-templates
Path: cli-tool/components/skills/ai-research/distributed-training-accelerate
anthropicanthropic-claudeclaudeclaude-code

Related Skills

quantizing-models-bitsandbytes

Other

This skill quantizes LLMs to 8-bit or 4-bit precision using bitsandbytes, achieving 50-75% memory reduction with minimal accuracy loss. It's ideal for running larger models on limited GPU memory or accelerating inference, supporting formats like INT8, NF4, and FP4. The skill integrates with HuggingFace Transformers and enables QLoRA training and 8-bit optimizers.

View skill

openrlhf-training

Design

OpenRLHF is a high-performance RLHF training framework for fine-tuning large language models (7B-70B+ parameters) using methods like PPO, DPO, and GRPO. It leverages Ray for distributed architecture and vLLM for accelerated inference, achieving speeds 2x faster than alternatives like DeepSpeedChat. Use this skill when you need efficient, distributed RLHF training with optimized GPU resource sharing and ZeRO-3 support.

View skill

weights-and-biases

Design

This skill integrates Weights & Biases for comprehensive ML experiment tracking and MLOps. It automatically logs metrics, visualizes training in real-time, and manages hyperparameter sweeps and model versions. Use it to compare runs, optimize models, and collaborate within team workspaces directly from your development environment.

View skill

huggingface-tokenizers

Documents

This skill provides high-performance tokenization using HuggingFace's Rust-based library, processing 1GB of text in under 20 seconds. It supports BPE, WordPiece, and Unigram algorithms while enabling custom tokenizer training and alignment tracking. Use it when you need production-fast tokenization or to build custom tokenizers integrated with the transformers ecosystem.

View skill