peft-fine-tuning

davila7

Updated Today

20 views

15,516

1,344

15,516

View on GitHub

OtherFine-TuningPEFTLoRAQLoRAParameter-EfficientAdaptersLow-RankMemory OptimizationMulti-Adapter

About

This skill enables parameter-efficient fine-tuning of large language models using LoRA, QLoRA, and other adapter methods, drastically reducing GPU memory requirements. It's ideal for fine-tuning 7B-70B models on consumer hardware by training less than 1% of parameters while maintaining accuracy. The integration with Hugging Face's ecosystem supports multi-adapter serving and rapid iteration with task-specific adapters.

Quick Install

Claude Code

Recommended

Plugin CommandRecommended

/plugin add https://github.com/davila7/claude-code-templates

Git CloneAlternative

git clone https://github.com/davila7/claude-code-templates.git ~/.claude/skills/peft-fine-tuning

Copy and paste this command in Claude Code to install this skill

Documentation

PEFT (Parameter-Efficient Fine-Tuning)

Fine-tune LLMs by training <1% of parameters using LoRA, QLoRA, and 25+ adapter methods.

When to use PEFT

Use PEFT/LoRA when:

Fine-tuning 7B-70B models on consumer GPUs (RTX 4090, A100)
Need to train <1% parameters (6MB adapters vs 14GB full model)
Want fast iteration with multiple task-specific adapters
Deploying multiple fine-tuned variants from one base model

Use QLoRA (PEFT + quantization) when:

Fine-tuning 70B models on single 24GB GPU
Memory is the primary constraint
Can accept ~5% quality trade-off vs full fine-tuning

Use full fine-tuning instead when:

Training small models (<1B parameters)
Need maximum quality and have compute budget
Significant domain shift requires updating all weights

Quick start

Installation

# Basic installation
pip install peft

# With quantization support (recommended)
pip install peft bitsandbytes

# Full stack
pip install peft transformers accelerate bitsandbytes datasets

LoRA fine-tuning (standard)

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import get_peft_model, LoraConfig, TaskType
from datasets import load_dataset

# Load base model
model_name = "meta-llama/Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                          # Rank (8-64, higher = more capacity)
    lora_alpha=32,                 # Scaling factor (typically 2*r)
    lora_dropout=0.05,             # Dropout for regularization
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # Attention layers
    bias="none"                    # Don't train biases
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 13,631,488 || all params: 8,043,307,008 || trainable%: 0.17%

# Prepare dataset
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

def tokenize(example):
    text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}"
    return tokenizer(text, truncation=True, max_length=512, padding="max_length")

tokenized = dataset.map(tokenize, remove_columns=dataset.column_names)

# Training
training_args = TrainingArguments(
    output_dir="./lora-llama",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized,
    data_collator=lambda data: {"input_ids": torch.stack([f["input_ids"] for f in data]),
                                 "attention_mask": torch.stack([f["attention_mask"] for f in data]),
                                 "labels": torch.stack([f["input_ids"] for f in data])}
)

trainer.train()

# Save adapter only (6MB vs 16GB)
model.save_pretrained("./lora-llama-adapter")

QLoRA fine-tuning (memory-efficient)

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_training

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NormalFloat4 (best for LLMs)
    bnb_4bit_compute_dtype="bfloat16",   # Compute in bf16
    bnb_4bit_use_double_quant=True       # Nested quantization
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B",
    quantization_config=bnb_config,
    device_map="auto"
)

# Prepare for training (enables gradient checkpointing)
model = prepare_model_for_kbit_training(model)

# LoRA config for QLoRA
lora_config = LoraConfig(
    r=64,                              # Higher rank for 70B
    lora_alpha=128,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
# 70B model now fits on single 24GB GPU!

LoRA parameter selection

Rank (r) - capacity vs efficiency

Rank	Trainable Params	Memory	Quality	Use Case
4	~3M	Minimal	Lower	Simple tasks, prototyping
8	~7M	Low	Good	Recommended starting point
16	~14M	Medium	Better	General fine-tuning
32	~27M	Higher	High	Complex tasks
64	~54M	High	Highest	Domain adaptation, 70B models

Alpha (lora_alpha) - scaling factor

# Rule of thumb: alpha = 2 * rank
LoraConfig(r=16, lora_alpha=32)  # Standard
LoraConfig(r=16, lora_alpha=16)  # Conservative (lower learning rate effect)
LoraConfig(r=16, lora_alpha=64)  # Aggressive (higher learning rate effect)

Target modules by architecture

# Llama / Mistral / Qwen
target_modules = ["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

# GPT-2 / GPT-Neo
target_modules = ["c_attn", "c_proj", "c_fc"]

# Falcon
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]

# BLOOM
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]

# Auto-detect all linear layers
target_modules = "all-linear"  # PEFT 0.6.0+

Loading and merging adapters

Load trained adapter

from peft import PeftModel, AutoPeftModelForCausalLM
from transformers import AutoModelForCausalLM

# Option 1: Load with PeftModel
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
model = PeftModel.from_pretrained(base_model, "./lora-llama-adapter")

# Option 2: Load directly (recommended)
model = AutoPeftModelForCausalLM.from_pretrained(
    "./lora-llama-adapter",
    device_map="auto"
)

Merge adapter into base model

# Merge for deployment (no adapter overhead)
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./llama-merged")
tokenizer.save_pretrained("./llama-merged")

# Push to Hub
merged_model.push_to_hub("username/llama-finetuned")

Multi-adapter serving

from peft import PeftModel

# Load base with first adapter
model = AutoPeftModelForCausalLM.from_pretrained("./adapter-task1")

# Load additional adapters
model.load_adapter("./adapter-task2", adapter_name="task2")
model.load_adapter("./adapter-task3", adapter_name="task3")

# Switch between adapters at runtime
model.set_adapter("task1")  # Use task1 adapter
output1 = model.generate(**inputs)

model.set_adapter("task2")  # Switch to task2
output2 = model.generate(**inputs)

# Disable adapters (use base model)
with model.disable_adapter():
    base_output = model.generate(**inputs)

PEFT methods comparison

Method	Trainable %	Memory	Speed	Best For
LoRA	0.1-1%	Low	Fast	General fine-tuning
QLoRA	0.1-1%	Very Low	Medium	Memory-constrained
AdaLoRA	0.1-1%	Low	Medium	Automatic rank selection
IA3	0.01%	Minimal	Fastest	Few-shot adaptation
Prefix Tuning	0.1%	Low	Medium	Generation control
Prompt Tuning	0.001%	Minimal	Fast	Simple task adaptation
P-Tuning v2	0.1%	Low	Medium	NLU tasks

IA3 (minimal parameters)

from peft import IA3Config

ia3_config = IA3Config(
    target_modules=["q_proj", "v_proj", "k_proj", "down_proj"],
    feedforward_modules=["down_proj"]
)
model = get_peft_model(model, ia3_config)
# Trains only 0.01% of parameters!

Prefix Tuning

from peft import PrefixTuningConfig

prefix_config = PrefixTuningConfig(
    task_type="CAUSAL_LM",
    num_virtual_tokens=20,      # Prepended tokens
    prefix_projection=True       # Use MLP projection
)
model = get_peft_model(model, prefix_config)

Integration patterns

With TRL (SFTTrainer)

from trl import SFTTrainer, SFTConfig
from peft import LoraConfig

lora_config = LoraConfig(r=16, lora_alpha=32, target_modules="all-linear")

trainer = SFTTrainer(
    model=model,
    args=SFTConfig(output_dir="./output", max_seq_length=512),
    train_dataset=dataset,
    peft_config=lora_config,  # Pass LoRA config directly
)
trainer.train()

With Axolotl (YAML config)

# axolotl config.yaml
adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - v_proj
  - k_proj
  - o_proj
lora_target_linear: true  # Target all linear layers

With vLLM (inference)

from vllm import LLM
from vllm.lora.request import LoRARequest

# Load base model with LoRA support
llm = LLM(model="meta-llama/Llama-3.1-8B", enable_lora=True)

# Serve with adapter
outputs = llm.generate(
    prompts,
    lora_request=LoRARequest("adapter1", 1, "./lora-adapter")
)

Performance benchmarks

Memory usage (Llama 3.1 8B)

Method	GPU Memory	Trainable Params
Full fine-tuning	60+ GB	8B (100%)
LoRA r=16	18 GB	14M (0.17%)
QLoRA r=16	6 GB	14M (0.17%)
IA3	16 GB	800K (0.01%)

Training speed (A100 80GB)

Method	Tokens/sec	vs Full FT
Full FT	2,500	1x
LoRA	3,200	1.3x
QLoRA	2,100	0.84x

Quality (MMLU benchmark)

Model	Full FT	LoRA	QLoRA
Llama 2-7B	45.3	44.8	44.1
Llama 2-13B	54.8	54.2	53.5

Common issues

CUDA OOM during training

# Solution 1: Enable gradient checkpointing
model.gradient_checkpointing_enable()

# Solution 2: Reduce batch size + increase accumulation
TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16
)

# Solution 3: Use QLoRA
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")

Adapter not applying

# Verify adapter is active
print(model.active_adapters)  # Should show adapter name

# Check trainable parameters
model.print_trainable_parameters()

# Ensure model in training mode
model.train()

Quality degradation

# Increase rank
LoraConfig(r=32, lora_alpha=64)

# Target more modules
target_modules = "all-linear"

# Use more training data and epochs
TrainingArguments(num_train_epochs=5)

# Lower learning rate
TrainingArguments(learning_rate=1e-4)

Best practices

Start with r=8-16, increase if quality insufficient
Use alpha = 2 * rank as starting point
Target attention + MLP layers for best quality/efficiency
Enable gradient checkpointing for memory savings
Save adapters frequently (small files, easy rollback)
Evaluate on held-out data before merging
Use QLoRA for 70B+ models on consumer hardware

References

Advanced Usage - DoRA, LoftQ, rank stabilization, custom modules
Troubleshooting - Common errors, debugging, optimization

Resources

GitHub: https://github.com/huggingface/peft
Docs: https://huggingface.co/docs/peft
LoRA Paper: arXiv:2106.09685
QLoRA Paper: arXiv:2305.14314
Models: https://huggingface.co/models?library=peft

GitHub Repository

davila7/claude-code-templates

Path: cli-tool/components/skills/ai-research/fine-tuning-peft

anthropicanthropic-claudeclaudeclaude-code

Related Skills

quantizing-models-bitsandbytes

Other

This skill quantizes LLMs to 8-bit or 4-bit precision using bitsandbytes, reducing memory usage by 50-75% with minimal accuracy loss for GPU-constrained environments. It supports multiple formats (INT8, NF4, FP4) and enables QLoRA training and 8-bit optimizers. Use it with HuggingFace Transformers when you need to fit larger models into limited memory or accelerate inference.

View skill

axolotl

Design

This skill provides expert guidance for fine-tuning LLMs using the Axolotl framework, helping developers configure YAML files and implement advanced techniques like LoRA/QLoRA and DPO/KTO. Use it when working with Axolotl features, debugging code, or learning best practices for fine-tuning across 100+ models. It offers comprehensive assistance including multimodal support and performance optimization.

View skill

unsloth

Design

This skill provides expert guidance for fast fine-tuning with Unsloth, offering 2-5x faster training and 50-80% memory reduction. It helps developers implement and debug LoRA/QLoRA optimizations for models like Llama and Mistral. Use it when working with Unsloth's APIs, features, or best practices for efficient model training.

View skill

llama-factory

Design

This skill provides expert guidance for fine-tuning LLMs using the LLaMA-Factory framework, including its no-code WebUI and support for 100+ models with QLoRA quantization. Use it when implementing, debugging, or learning best practices for LLaMA-Factory solutions in your development workflow.

View skill