MCP HubMCP Hub
返回技能列表

training-llms-megatron

zechenzhangAGI
更新于 Today
24 次查看
62
2
62
在 GitHub 上查看
设计aidesign

关于

This Claude Skill trains large language models (2B-462B parameters) using NVIDIA's Megatron-Core framework with advanced parallelism strategies. Use it when training models over 1B parameters, needing maximum GPU efficiency (47% MFU on H100), or requiring tensor/pipeline/sequence parallelism. It's a production-ready framework proven on models like Nemotron, LLaMA, and DeepSeek.

技能文档

Megatron-Core - Large-Scale LLM Training

Quick start

Megatron-Core trains LLMs from 2B to 462B parameters with up to 47% Model FLOP Utilization on H100 GPUs through advanced parallelism strategies.

Installation:

# Docker (recommended)
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:25.04-py3

# Or pip
pip install megatron-core

Simple distributed training:

# Train with 2 GPUs using data parallelism
torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py

# Or LLaMA-3 8B training
./examples/llama/train_llama3_8b_fp8.sh

Common workflows

Workflow 1: Train LLaMA-style model with 3D parallelism

Copy this checklist:

LLaMA Training Setup:
- [ ] Step 1: Choose parallelism configuration
- [ ] Step 2: Configure training hyperparameters
- [ ] Step 3: Launch distributed training
- [ ] Step 4: Monitor performance metrics

Step 1: Choose parallelism configuration

Model size determines parallelism strategy:

Model SizeGPUsTensor ParallelPipeline ParallelData ParallelContext Parallel
7B81181
13B82141
70B644441
405B1288822

Step 2: Configure training hyperparameters

#!/bin/bash
# train_llama_70b.sh

GPUS_PER_NODE=8
NNODES=8  # 64 GPUs total
TP=4      # Tensor parallel
PP=4      # Pipeline parallel
CP=1      # Context parallel

# LLaMA 70B configuration
MODEL_SIZE=70  # Billion parameters
HIDDEN_SIZE=8192
NUM_LAYERS=80
NUM_HEADS=64
SEQ_LENGTH=4096

# Training hyperparameters
MICRO_BATCH=1
GLOBAL_BATCH=1024
LR=3e-4

torchrun \
  --nproc_per_node=$GPUS_PER_NODE \
  --nnodes=$NNODES \
  pretrain_gpt.py \
  --tensor-model-parallel-size $TP \
  --pipeline-model-parallel-size $PP \
  --context-parallel-size $CP \
  --sequence-parallel \
  --num-layers $NUM_LAYERS \
  --hidden-size $HIDDEN_SIZE \
  --num-attention-heads $NUM_HEADS \
  --seq-length $SEQ_LENGTH \
  --max-position-embeddings $SEQ_LENGTH \
  --micro-batch-size $MICRO_BATCH \
  --global-batch-size $GLOBAL_BATCH \
  --lr $LR \
  --train-iters 100000 \
  --lr-decay-style cosine \
  --lr-warmup-iters 2000 \
  --weight-decay 0.1 \
  --clip-grad 1.0 \
  --bf16 \
  --use-mcore-models \
  --transformer-impl transformer_engine \
  --data-path /path/to/data \
  --vocab-file /path/to/vocab.json \
  --merge-file /path/to/merges.txt

Step 3: Launch distributed training

# Single node (8 GPUs)
bash train_llama_70b.sh

# Multi-node with SLURM
sbatch --nodes=8 --gpus-per-node=8 train_llama_70b.sh

Step 4: Monitor performance metrics

Key metrics to track:

Model FLOP Utilization (MFU): Target >40% on H100
Throughput: Tokens/sec/GPU
Memory usage: <80GB per GPU for 70B model
Loss: Should decrease steadily

Workflow 2: Configure Mixture of Experts (MoE) training

For sparse MoE models like Mixtral.

MoE Training:
- [ ] Step 1: Configure expert parallelism
- [ ] Step 2: Set MoE hyperparameters
- [ ] Step 3: Launch training with EP

Step 1: Configure expert parallelism

# Mixtral 8x7B example
TENSOR_PARALLEL=2
PIPELINE_PARALLEL=1
EXPERT_PARALLEL=4  # Split 8 experts across 4 GPUs
DATA_PARALLEL=4

TOTAL_GPUS=$((TENSOR_PARALLEL * PIPELINE_PARALLEL * EXPERT_PARALLEL * DATA_PARALLEL))
# = 2 * 1 * 4 * 4 = 32 GPUs

Step 2: Set MoE hyperparameters

torchrun \
  --nproc_per_node=8 \
  pretrain_gpt.py \
  --tensor-model-parallel-size 2 \
  --pipeline-model-parallel-size 1 \
  --expert-model-parallel-size 4 \
  --num-experts 8 \
  --moe-router-topk 2 \
  --moe-router-load-balancing-type aux_loss \
  --moe-aux-loss-coeff 0.01 \
  --hidden-size 4096 \
  --num-layers 32 \
  --num-attention-heads 32 \
  --seq-length 4096 \
  --max-position-embeddings 4096 \
  --bf16 \
  --use-mcore-models \
  --transformer-impl transformer_engine \
  --data-path /path/to/data \
  --vocab-file /path/to/vocab.json \
  --merge-file /path/to/merges.txt

Step 3: Launch training with EP

Expert parallelism distributes different experts across GPUs, reducing memory while maintaining capacity.

Memory without EP: 8 experts × 7B = 56GB per GPU
Memory with EP=4: 2 experts × 7B = 14GB per GPU
Savings: 75% memory reduction

Workflow 3: Optimize for maximum throughput

Achieve 47% MFU on H100.

Performance Optimization:
- [ ] Step 1: Enable Flash Attention
- [ ] Step 2: Use FP8 precision (H100)
- [ ] Step 3: Optimize micro-batch size
- [ ] Step 4: Tune parallelism degrees

Step 1: Enable optimizations

--use-mcore-models  # Use Megatron Core models
--transformer-impl transformer_engine  # Use Transformer Engine
--sequence-parallel  # Reduce activation memory (use with TP)

Step 2: Use FP8 precision (H100 only)

--fp8-hybrid  # FP8 mixed precision training
# Transformer Engine handles FP8 automatically

Result: 1.5-2x speedup on H100 vs BF16.

Step 3: Optimize micro-batch size

Find largest micro-batch that fits in memory:

# Start with 1, increase until OOM
for MBS in 1 2 4 8; do
  echo "Testing micro-batch-size=$MBS"
  torchrun ... --micro-batch-size $MBS
done

Typical values:

  • 7B model: 4-8
  • 70B model: 1-2
  • 405B model: 1

Step 4: Tune parallelism degrees

Rules of thumb:

Tensor Parallel: Use ≤8 (limited by NVLink within node)
Pipeline Parallel: Use for >70B models
Context Parallel: Use for sequences >8K tokens
Data Parallel: Fill remaining GPUs

Example 405B on 128 H100s:

TP=8 (1 node)
PP=8 (across nodes)
CP=2 (long sequences)
DP=1
Total = 8 × 8 × 2 × 1 = 128 GPUs

When to use vs alternatives

Use Megatron-Core when:

  • Training models >10B parameters
  • Need maximum efficiency (target >40% MFU)
  • Using NVIDIA GPUs (A100, H100)
  • Production training at scale
  • Want fine-grained parallelism control

Use alternatives instead:

  • PyTorch FSDP: Models <70B, simpler API, PyTorch native
  • DeepSpeed: Easier setup, good for <100B models
  • HuggingFace Accelerate: Prototyping, simpler workflows
  • LitGPT: Educational, single-file implementations

Common issues

Issue: Low GPU utilization (<30% MFU)

Causes:

  1. Micro-batch too small
  2. Too much parallelism overhead
  3. Not using Flash Attention

Fixes:

# Increase micro-batch
--micro-batch-size 4  # Was 1

# Enable optimizations
--use-flash-attn
--sequence-parallel

# Reduce TP if >8
--tensor-model-parallel-size 4  # Was 16

Issue: Out of memory

Reduce memory with:

--tensor-model-parallel-size 2  # Split model across GPUs
--recompute-granularity full  # Gradient checkpointing
--recompute-method block  # Checkpoint transformer blocks
--recompute-num-layers 1  # Checkpoint every layer

Or use CPU/NVMe offloading:

--cpu-optimizer  # Offload optimizer to CPU
--cpu-optimizer-type ADAM  # CPU Adam variant

Issue: Training slower than expected

Check:

  1. Network bottleneck: Ensure InfiniBand/NVLink enabled
  2. Pipeline bubbles: Use interleaved pipeline schedule
    --num-layers-per-virtual-pipeline-stage 2
    
  3. Data loading: Use fast data loader
    --dataloader-type cyclic
    

Issue: Diverging loss

Stabilize training:

--lr-warmup-iters 2000  # Longer warmup
--clip-grad 1.0  # Gradient clipping
--init-method-std 0.006  # Smaller init
--attention-dropout 0.0  # No dropout in attention
--hidden-dropout 0.0  # No dropout in FFN

Advanced topics

Parallelism strategies: See references/parallelism-guide.md for detailed comparison of TP/PP/DP/CP/EP with performance analysis and when to use each.

Performance benchmarks: See references/benchmarks.md for MFU numbers across different model sizes and GPU configurations.

Production configurations: See references/production-examples.md for real-world setups from LLaMA 3 405B, Nemotron-4 340B, and DeepSeek-V3 671B.

Training recipes: See references/training-recipes.md for complete hyperparameter configurations for GPT/LLaMA/Mixtral architectures.

Hardware requirements

  • GPU: NVIDIA Ampere+ (A100, H100, B200)
    • Turing works but slower
    • FP8 requires Hopper/Ada/Blackwell
  • Network: InfiniBand or 400Gb+ Ethernet for multi-node
  • Memory per GPU:
    • 7B model: 40GB+
    • 70B model: 80GB (with TP=4)
    • 405B model: 80GB (with TP=8, PP=8)
  • Storage: Fast NVMe for checkpoints (1TB+ for 70B+ models)

Resources

快速安装

/plugin add https://github.com/zechenzhangAGI/AI-research-SKILLs/tree/main/megatron-core

在 Claude Code 中复制并粘贴此命令以安装该技能

GitHub 仓库

zechenzhangAGI/AI-research-SKILLs
路径: 08-distributed-training/megatron-core
aiai-researchclaudeclaude-codeclaude-skillscodex

相关推荐技能

llamaguard

其他

LlamaGuard是Meta推出的7-8B参数内容审核模型,专门用于过滤LLM的输入和输出内容。它能检测六大安全风险类别(暴力/仇恨、性内容、武器、违禁品、自残、犯罪计划),准确率达94-95%。开发者可通过HuggingFace、vLLM或Sagemaker快速部署,并能与NeMo Guardrails集成实现自动化安全防护。

查看技能

sglang

SGLang是一个专为LLM设计的高性能推理框架,特别适用于需要结构化输出的场景。它通过RadixAttention前缀缓存技术,在处理JSON、正则表达式、工具调用等具有重复前缀的复杂工作流时,能实现极速生成。如果你正在构建智能体或多轮对话系统,并追求远超vLLM的推理性能,SGLang是理想选择。

查看技能

evaluating-llms-harness

测试

该Skill通过60+个学术基准测试(如MMLU、GSM8K等)评估大语言模型质量,适用于模型对比、学术研究及训练进度追踪。它支持HuggingFace、vLLM和API接口,被EleutherAI等行业领先机构广泛采用。开发者可通过简单命令行快速对模型进行多任务批量评估。

查看技能

langchain

LangChain是一个用于构建LLM应用程序的框架,支持智能体、链和RAG应用开发。它提供多模型提供商支持、500+工具集成、记忆管理和向量检索等核心功能。开发者可用它快速构建聊天机器人、问答系统和自主代理,适用于从原型验证到生产部署的全流程。

查看技能