model-merging

davila7

Updated Today

15 views

15,516

1,344

15,516

MetaEmerging TechniquesModel MergingMergekitSLERPTIESDARETask ArithmeticModel FusionNo RetrainingMulti-CapabilityArcee AI

About

This skill enables merging multiple fine-tuned models using mergekit to combine capabilities without retraining. It's ideal for creating specialized models by blending expertise (like math, coding, and chat) and for rapid experimentation with variants. The skill covers key techniques including SLERP, TIES-Merging, DARE, and Task Arithmetic.

Quick Install

Claude Code

Recommended

Plugin CommandRecommended

/plugin add https://github.com/davila7/claude-code-templates

Git CloneAlternative

git clone https://github.com/davila7/claude-code-templates.git ~/.claude/skills/model-merging

Copy and paste this command in Claude Code to install this skill

Documentation

Model Merging: Combining Pre-trained Models

When to Use This Skill

Use Model Merging when you need to:

Combine capabilities from multiple fine-tuned models without retraining
Create specialized models by blending domain-specific expertise (math + coding + chat)
Improve performance beyond single models (often +5-10% on benchmarks)
Reduce training costs - no GPUs needed, merges run on CPU
Experiment rapidly - create new model variants in minutes, not days
Preserve multiple skills - merge without catastrophic forgetting

Success Stories: Marcoro14-7B-slerp (best on Open LLM Leaderboard 02/2024), many top HuggingFace models use merging

Tools: mergekit (Arcee AI), LazyMergekit, Model Soup

Installation

# Install mergekit
git clone https://github.com/arcee-ai/mergekit.git
cd mergekit
pip install -e .

# Or via pip
pip install mergekit

# Optional: Transformer library
pip install transformers torch

Quick Start

Simple Linear Merge

# config.yml - Merge two models with equal weights
merge_method: linear
models:
  - model: mistralai/Mistral-7B-v0.1
    parameters:
      weight: 0.5
  - model: teknium/OpenHermes-2.5-Mistral-7B
    parameters:
      weight: 0.5
dtype: bfloat16

# Run merge
mergekit-yaml config.yml ./merged-model --cuda

# Use merged model
python -m transformers.models.auto --model_name_or_path ./merged-model

SLERP Merge (Best for 2 Models)

# config.yml - Spherical interpolation
merge_method: slerp
slices:
  - sources:
      - model: mistralai/Mistral-7B-v0.1
        layer_range: [0, 32]
      - model: teknium/OpenHermes-2.5-Mistral-7B
        layer_range: [0, 32]
parameters:
  t: 0.5  # Interpolation factor (0=model1, 1=model2)
dtype: bfloat16

Core Concepts

1. Merge Methods

Linear (Model Soup)

Simple weighted average of parameters
Fast, works well for similar models
Can merge 2+ models

merged_weights = w1 * model1_weights + w2 * model2_weights + w3 * model3_weights
# where w1 + w2 + w3 = 1

SLERP (Spherical Linear Interpolation)

Interpolates along sphere in weight space
Preserves magnitude of weight vectors
Best for merging 2 models
Smoother than linear

# SLERP formula
merged = (sin((1-t)*θ) / sin(θ)) * model1 + (sin(t*θ) / sin(θ)) * model2
# where θ = arccos(dot(model1, model2))
# t ∈ [0, 1]

Task Arithmetic

Extract "task vectors" (fine-tuned - base)
Combine task vectors, add to base
Good for merging multiple specialized models

# Task vector
task_vector = finetuned_model - base_model

# Merge multiple task vectors
merged = base_model + α₁*task_vector₁ + α₂*task_vector₂

TIES-Merging

Task arithmetic + sparsification
Resolves sign conflicts in parameters
Best for merging many task-specific models

DARE (Drop And REscale)

Randomly drops fine-tuned parameters
Rescales remaining parameters
Reduces redundancy, maintains performance

2. Configuration Structure

# Basic structure
merge_method: <method>  # linear, slerp, ties, dare_ties, task_arithmetic
base_model: <path>      # Optional: base model for task arithmetic

models:
  - model: <path/to/model1>
    parameters:
      weight: <float>   # Merge weight
      density: <float>  # For TIES/DARE

  - model: <path/to/model2>
    parameters:
      weight: <float>

parameters:
  # Method-specific parameters

dtype: <dtype>  # bfloat16, float16, float32

# Optional
slices:  # Layer-wise merging
tokenizer:  # Tokenizer configuration

Merge Methods Guide

Linear Merge

Best for: Simple model combinations, equal weighting

merge_method: linear
models:
  - model: WizardLM/WizardMath-7B-V1.1
    parameters:
      weight: 0.4
  - model: teknium/OpenHermes-2.5-Mistral-7B
    parameters:
      weight: 0.3
  - model: NousResearch/Nous-Hermes-2-Mistral-7B-DPO
    parameters:
      weight: 0.3
dtype: bfloat16

SLERP Merge

Best for: Two models, smooth interpolation

merge_method: slerp
slices:
  - sources:
      - model: mistralai/Mistral-7B-v0.1
        layer_range: [0, 32]
      - model: teknium/OpenHermes-2.5-Mistral-7B
        layer_range: [0, 32]
parameters:
  t: 0.5  # 0.0 = first model, 1.0 = second model
dtype: bfloat16

Layer-specific SLERP:

merge_method: slerp
slices:
  - sources:
      - model: model_a
        layer_range: [0, 32]
      - model: model_b
        layer_range: [0, 32]
parameters:
  t:
    - filter: self_attn    # Attention layers
      value: 0.3
    - filter: mlp          # MLP layers
      value: 0.7
    - value: 0.5           # Default for other layers
dtype: bfloat16

Task Arithmetic

Best for: Combining specialized skills

merge_method: task_arithmetic
base_model: mistralai/Mistral-7B-v0.1
models:
  - model: WizardLM/WizardMath-7B-V1.1  # Math
    parameters:
      weight: 0.5
  - model: teknium/OpenHermes-2.5-Mistral-7B  # Chat
    parameters:
      weight: 0.3
  - model: ajibawa-2023/Code-Mistral-7B  # Code
    parameters:
      weight: 0.2
dtype: bfloat16

TIES-Merging

Best for: Many models, resolving conflicts

merge_method: ties
base_model: mistralai/Mistral-7B-v0.1
models:
  - model: WizardLM/WizardMath-7B-V1.1
    parameters:
      density: 0.5  # Keep top 50% of parameters
      weight: 1.0
  - model: teknium/OpenHermes-2.5-Mistral-7B
    parameters:
      density: 0.5
      weight: 1.0
  - model: NousResearch/Nous-Hermes-2-Mistral-7B-DPO
    parameters:
      density: 0.5
      weight: 1.0
parameters:
  normalize: true
dtype: bfloat16

DARE Merge

Best for: Reducing redundancy

merge_method: dare_ties
base_model: mistralai/Mistral-7B-v0.1
models:
  - model: WizardLM/WizardMath-7B-V1.1
    parameters:
      density: 0.5    # Drop 50% of deltas
      weight: 0.6
  - model: teknium/OpenHermes-2.5-Mistral-7B
    parameters:
      density: 0.5
      weight: 0.4
parameters:
  int8_mask: true  # Use int8 for masks (saves memory)
dtype: bfloat16

Advanced Patterns

Layer-wise Merging

# Different models for different layers
merge_method: passthrough
slices:
  - sources:
      - model: mistralai/Mistral-7B-v0.1
        layer_range: [0, 16]   # First half
  - sources:
      - model: teknium/OpenHermes-2.5-Mistral-7B
        layer_range: [16, 32]  # Second half
dtype: bfloat16

MoE from Merged Models

# Create Mixture of Experts
merge_method: moe
base_model: mistralai/Mistral-7B-v0.1
experts:
  - source_model: WizardLM/WizardMath-7B-V1.1
    positive_prompts:
      - "math"
      - "calculate"
  - source_model: teknium/OpenHermes-2.5-Mistral-7B
    positive_prompts:
      - "chat"
      - "conversation"
  - source_model: ajibawa-2023/Code-Mistral-7B
    positive_prompts:
      - "code"
      - "python"
dtype: bfloat16

Tokenizer Merging

merge_method: linear
models:
  - model: mistralai/Mistral-7B-v0.1
  - model: custom/specialized-model

tokenizer:
  source: "union"  # Combine vocabularies from both models
  tokens:
    <|special_token|>:
      source: "custom/specialized-model"

Best Practices

1. Model Compatibility

# ✅ Good: Same architecture
models = [
    "mistralai/Mistral-7B-v0.1",
    "teknium/OpenHermes-2.5-Mistral-7B",  # Both Mistral 7B
]

# ❌ Bad: Different architectures
models = [
    "meta-llama/Llama-2-7b-hf",  # Llama
    "mistralai/Mistral-7B-v0.1",  # Mistral (incompatible!)
]

2. Weight Selection

# ✅ Good: Weights sum to 1.0
models:
  - model: model_a
    parameters:
      weight: 0.6
  - model: model_b
    parameters:
      weight: 0.4  # 0.6 + 0.4 = 1.0

# ⚠️  Acceptable: Weights don't sum to 1 (for task arithmetic)
models:
  - model: model_a
    parameters:
      weight: 0.8
  - model: model_b
    parameters:
      weight: 0.8  # May boost performance

3. Method Selection

# Choose merge method based on use case:

# 2 models, smooth blend → SLERP
merge_method = "slerp"

# 3+ models, simple average → Linear
merge_method = "linear"

# Multiple task-specific models → Task Arithmetic or TIES
merge_method = "ties"

# Want to reduce redundancy → DARE
merge_method = "dare_ties"

4. Density Tuning (TIES/DARE)

# Start conservative (keep more parameters)
parameters:
  density: 0.8  # Keep 80%

# If performance good, increase sparsity
parameters:
  density: 0.5  # Keep 50%

# If performance degrades, reduce sparsity
parameters:
  density: 0.9  # Keep 90%

5. Layer-specific Merging

# Preserve base model's beginning and end
merge_method: passthrough
slices:
  - sources:
      - model: base_model
        layer_range: [0, 2]     # Keep first layers
  - sources:
      - model: merged_middle    # Merge middle layers
        layer_range: [2, 30]
  - sources:
      - model: base_model
        layer_range: [30, 32]   # Keep last layers

Evaluation & Testing

Benchmark Merged Models

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load merged model
model = AutoModelForCausalLM.from_pretrained("./merged-model")
tokenizer = AutoTokenizer.from_pretrained("./merged-model")

# Test on various tasks
test_prompts = {
    "math": "Calculate: 25 * 17 =",
    "code": "Write a Python function to reverse a string:",
    "chat": "What is the capital of France?",
}

for task, prompt in test_prompts.items():
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_length=100)
    print(f"{task}: {tokenizer.decode(outputs[0])}")

Common Benchmarks

Open LLM Leaderboard: General capabilities
MT-Bench: Multi-turn conversation
MMLU: Multitask accuracy
HumanEval: Code generation
GSM8K: Math reasoning

Production Deployment

Save and Upload

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load merged model
model = AutoModelForCausalLM.from_pretrained("./merged-model")
tokenizer = AutoTokenizer.from_pretrained("./merged-model")

# Upload to HuggingFace Hub
model.push_to_hub("username/my-merged-model")
tokenizer.push_to_hub("username/my-merged-model")

Quantize Merged Model

# Quantize with GGUF
python convert.py ./merged-model --outtype f16 --outfile merged-model.gguf

# Quantize with GPTQ
python quantize_gptq.py ./merged-model --bits 4 --group_size 128

Common Pitfalls

❌ Pitfall 1: Merging Incompatible Models

# Wrong: Different architectures
models:
  - model: meta-llama/Llama-2-7b  # Llama architecture
  - model: mistralai/Mistral-7B   # Mistral architecture

Fix: Only merge models with same architecture

❌ Pitfall 2: Over-weighting One Model

# Suboptimal: One model dominates
models:
  - model: model_a
    parameters:
      weight: 0.95  # Too high
  - model: model_b
    parameters:
      weight: 0.05  # Too low

Fix: Use more balanced weights (0.3-0.7 range)

❌ Pitfall 3: Not Evaluating

# Wrong: Merge and deploy without testing
mergekit-yaml config.yml ./merged-model
# Deploy immediately (risky!)

Fix: Always benchmark before deploying

Resources

mergekit GitHub: https://github.com/arcee-ai/mergekit
HuggingFace Tutorial: https://huggingface.co/blog/mlabonne/merge-models
LazyMergekit: Automated merging notebook
TIES Paper: https://arxiv.org/abs/2306.01708
DARE Paper: https://arxiv.org/abs/2311.03099

GitHub Repository

davila7/claude-code-templates

Path: cli-tool/components/skills/ai-research/emerging-techniques-model-merging

anthropicanthropic-claudeclaudeclaude-code

Related Skills

speculative-decoding

knowledge-distillation

Other

This skill enables model compression via knowledge distillation, transferring capabilities from larger teacher models to smaller student models. It's ideal for deploying cost-efficient models while retaining performance or transferring GPT-4-level abilities to open-source alternatives. Key techniques include temperature scaling, soft targets, and logit distillation.

View skill

model-pruning

Other

This skill compresses large language models using pruning techniques like Wanda and SparseGPT to reduce model size by 40-60% and accelerate inference 2-4× with minimal accuracy loss. It enables one-shot compression without retraining and supports various sparsity patterns including N:M for hardware acceleration. Use it when deploying models on constrained hardware or needing faster inference with maintained performance.

View skill

long-context

Documentation

This skill enables extending transformer model context windows beyond their original limits using techniques like RoPE, YaRN, and position interpolation. Use it when processing long documents (32k-128k+ tokens) or adapting pre-trained models for longer sequences. It provides implementations for rotary embeddings, attention biases, and interpolation strategies to handle extended positional encodings efficiently.

View skill

model-merging

About

Quick Install

Claude Code

Documentation

Model Merging: Combining Pre-trained Models

When to Use This Skill

Installation

Quick Start

Simple Linear Merge

SLERP Merge (Best for 2 Models)

Core Concepts

1. Merge Methods

2. Configuration Structure

Merge Methods Guide

Linear Merge

SLERP Merge

Task Arithmetic

TIES-Merging

DARE Merge

Advanced Patterns

Layer-wise Merging

MoE from Merged Models

Tokenizer Merging

Best Practices

1. Model Compatibility

2. Weight Selection

3. Method Selection

4. Density Tuning (TIES/DARE)

5. Layer-specific Merging

Evaluation & Testing

Benchmark Merged Models

Common Benchmarks

Production Deployment

Save and Upload

Quantize Merged Model

Common Pitfalls

❌ Pitfall 1: Merging Incompatible Models

❌ Pitfall 2: Over-weighting One Model

❌ Pitfall 3: Not Evaluating

Resources

See Also

GitHub Repository

Related Skills

speculative-decoding

knowledge-distillation

model-pruning

long-context