Back to Skills

gguf-quantization

davila7
Updated Today
12 views
15,516
1,344
15,516
View on GitHub
DesignGGUFQuantizationllama.cppCPU InferenceApple SiliconModel CompressionOptimization

About

This skill enables GGUF format quantization for efficient model deployment using llama.cpp, allowing flexible 2-8 bit quantization without GPU requirements. It's ideal for running models on consumer hardware, Apple Silicon, or CPU-only environments. Key capabilities include Metal acceleration support and compatibility with local AI tools like LM Studio and Ollama.

Quick Install

Claude Code

Recommended
Plugin CommandRecommended
/plugin add https://github.com/davila7/claude-code-templates
Git CloneAlternative
git clone https://github.com/davila7/claude-code-templates.git ~/.claude/skills/gguf-quantization

Copy and paste this command in Claude Code to install this skill

Documentation

GGUF - Quantization Format for llama.cpp

The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options.

When to use GGUF

Use GGUF when:

  • Deploying on consumer hardware (laptops, desktops)
  • Running on Apple Silicon (M1/M2/M3) with Metal acceleration
  • Need CPU inference without GPU requirements
  • Want flexible quantization (Q2_K to Q8_0)
  • Using local AI tools (LM Studio, Ollama, text-generation-webui)

Key advantages:

  • Universal hardware: CPU, Apple Silicon, NVIDIA, AMD support
  • No Python runtime: Pure C/C++ inference
  • Flexible quantization: 2-8 bit with various methods (K-quants)
  • Ecosystem support: LM Studio, Ollama, koboldcpp, and more
  • imatrix: Importance matrix for better low-bit quality

Use alternatives instead:

  • AWQ/GPTQ: Maximum accuracy with calibration on NVIDIA GPUs
  • HQQ: Fast calibration-free quantization for HuggingFace
  • bitsandbytes: Simple integration with transformers library
  • TensorRT-LLM: Production NVIDIA deployment with maximum speed

Quick start

Installation

# Clone llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# Build (CPU)
make

# Build with CUDA (NVIDIA)
make GGML_CUDA=1

# Build with Metal (Apple Silicon)
make GGML_METAL=1

# Install Python bindings (optional)
pip install llama-cpp-python

Convert model to GGUF

# Install requirements
pip install -r requirements.txt

# Convert HuggingFace model to GGUF (FP16)
python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf

# Or specify output type
python convert_hf_to_gguf.py ./path/to/model \
    --outfile model-f16.gguf \
    --outtype f16

Quantize model

# Basic quantization to Q4_K_M
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M

# Quantize with importance matrix (better quality)
./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M

Run inference

# CLI inference
./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?"

# Interactive mode
./llama-cli -m model-q4_k_m.gguf --interactive

# With GPU offload
./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"

Quantization types

K-quant methods (recommended)

TypeBitsSize (7B)QualityUse Case
Q2_K2.5~2.8 GBLowExtreme compression
Q3_K_S3.0~3.0 GBLow-MedMemory constrained
Q3_K_M3.3~3.3 GBMediumBalance
Q4_K_S4.0~3.8 GBMed-HighGood balance
Q4_K_M4.5~4.1 GBHighRecommended default
Q5_K_S5.0~4.6 GBHighQuality focused
Q5_K_M5.5~4.8 GBVery HighHigh quality
Q6_K6.0~5.5 GBExcellentNear-original
Q8_08.0~7.2 GBBestMaximum quality

Legacy methods

TypeDescription
Q4_04-bit, basic
Q4_14-bit with delta
Q5_05-bit, basic
Q5_15-bit with delta

Recommendation: Use K-quant methods (Q4_K_M, Q5_K_M) for best quality/size ratio.

Conversion workflows

Workflow 1: HuggingFace to GGUF

# 1. Download model
huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b

# 2. Convert to GGUF (FP16)
python convert_hf_to_gguf.py ./llama-3.1-8b \
    --outfile llama-3.1-8b-f16.gguf \
    --outtype f16

# 3. Quantize
./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M

# 4. Test
./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50

Workflow 2: With importance matrix (better quality)

# 1. Convert to GGUF
python convert_hf_to_gguf.py ./model --outfile model-f16.gguf

# 2. Create calibration text (diverse samples)
cat > calibration.txt << 'EOF'
The quick brown fox jumps over the lazy dog.
Machine learning is a subset of artificial intelligence.
Python is a popular programming language.
# Add more diverse text samples...
EOF

# 3. Generate importance matrix
./llama-imatrix -m model-f16.gguf \
    -f calibration.txt \
    --chunk 512 \
    -o model.imatrix \
    -ngl 35  # GPU layers if available

# 4. Quantize with imatrix
./llama-quantize --imatrix model.imatrix \
    model-f16.gguf \
    model-q4_k_m.gguf \
    Q4_K_M

Workflow 3: Multiple quantizations

#!/bin/bash
MODEL="llama-3.1-8b-f16.gguf"
IMATRIX="llama-3.1-8b.imatrix"

# Generate imatrix once
./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35

# Create multiple quantizations
for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
    OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
    ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
    echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
done

Python usage

llama-cpp-python

from llama_cpp import Llama

# Load model
llm = Llama(
    model_path="./model-q4_k_m.gguf",
    n_ctx=4096,          # Context window
    n_gpu_layers=35,     # GPU offload (0 for CPU only)
    n_threads=8          # CPU threads
)

# Generate
output = llm(
    "What is machine learning?",
    max_tokens=256,
    temperature=0.7,
    stop=["</s>", "\n\n"]
)
print(output["choices"][0]["text"])

Chat completion

from llama_cpp import Llama

llm = Llama(
    model_path="./model-q4_k_m.gguf",
    n_ctx=4096,
    n_gpu_layers=35,
    chat_format="llama-3"  # Or "chatml", "mistral", etc.
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"}
]

response = llm.create_chat_completion(
    messages=messages,
    max_tokens=256,
    temperature=0.7
)
print(response["choices"][0]["message"]["content"])

Streaming

from llama_cpp import Llama

llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)

# Stream tokens
for chunk in llm(
    "Explain quantum computing:",
    max_tokens=256,
    stream=True
):
    print(chunk["choices"][0]["text"], end="", flush=True)

Server mode

Start OpenAI-compatible server

# Start server
./llama-server -m model-q4_k_m.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 35 \
    -c 4096

# Or with Python bindings
python -m llama_cpp.server \
    --model model-q4_k_m.gguf \
    --n_gpu_layers 35 \
    --host 0.0.0.0 \
    --port 8080

Use with OpenAI client

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="local-model",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=256
)
print(response.choices[0].message.content)

Hardware optimization

Apple Silicon (Metal)

# Build with Metal
make clean && make GGML_METAL=1

# Run with Metal acceleration
./llama-cli -m model.gguf -ngl 99 -p "Hello"

# Python with Metal
llm = Llama(
    model_path="model.gguf",
    n_gpu_layers=99,     # Offload all layers
    n_threads=1          # Metal handles parallelism
)

NVIDIA CUDA

# Build with CUDA
make clean && make GGML_CUDA=1

# Run with CUDA
./llama-cli -m model.gguf -ngl 35 -p "Hello"

# Specify GPU
CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35

CPU optimization

# Build with AVX2/AVX512
make clean && make

# Run with optimal threads
./llama-cli -m model.gguf -t 8 -p "Hello"

# Python CPU config
llm = Llama(
    model_path="model.gguf",
    n_gpu_layers=0,      # CPU only
    n_threads=8,         # Match physical cores
    n_batch=512          # Batch size for prompt processing
)

Integration with tools

Ollama

# Create Modelfile
cat > Modelfile << 'EOF'
FROM ./model-q4_k_m.gguf
TEMPLATE """{{ .System }}
{{ .Prompt }}"""
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
EOF

# Create Ollama model
ollama create mymodel -f Modelfile

# Run
ollama run mymodel "Hello!"

LM Studio

  1. Place GGUF file in ~/.cache/lm-studio/models/
  2. Open LM Studio and select the model
  3. Configure context length and GPU offload
  4. Start inference

text-generation-webui

# Place in models folder
cp model-q4_k_m.gguf text-generation-webui/models/

# Start with llama.cpp loader
python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35

Best practices

  1. Use K-quants: Q4_K_M offers best quality/size balance
  2. Use imatrix: Always use importance matrix for Q4 and below
  3. GPU offload: Offload as many layers as VRAM allows
  4. Context length: Start with 4096, increase if needed
  5. Thread count: Match physical CPU cores, not logical
  6. Batch size: Increase n_batch for faster prompt processing

Common issues

Model loads slowly:

# Use mmap for faster loading
./llama-cli -m model.gguf --mmap

Out of memory:

# Reduce GPU layers
./llama-cli -m model.gguf -ngl 20  # Reduce from 35

# Or use smaller quantization
./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M

Poor quality at low bits:

# Always use imatrix for Q4 and below
./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M

References

Resources

GitHub Repository

davila7/claude-code-templates
Path: cli-tool/components/skills/ai-research/optimization-gguf
anthropicanthropic-claudeclaudeclaude-code

Related Skills

quantizing-models-bitsandbytes

Other

This skill quantizes LLMs to 8-bit or 4-bit precision using bitsandbytes, reducing memory usage by 50-75% with minimal accuracy loss for GPU-constrained environments. It supports multiple formats (INT8, NF4, FP4) and enables QLoRA training and 8-bit optimizers. Use it with HuggingFace Transformers when you need to fit larger models into limited memory or accelerate inference.

View skill

knowledge-distillation

Other

This skill enables model compression via knowledge distillation, transferring capabilities from larger teacher models to smaller student models. It's ideal for deploying cost-efficient models while retaining performance or transferring GPT-4-level abilities to open-source alternatives. Key techniques include temperature scaling, soft targets, and logit distillation.

View skill

model-pruning

Other

This skill compresses large language models using pruning techniques like Wanda and SparseGPT to reduce model size by 40-60% and accelerate inference 2-4× with minimal accuracy loss. It enables one-shot compression without retraining and supports various sparsity patterns including N:M for hardware acceleration. Use it when deploying models on constrained hardware or needing faster inference with maintained performance.

View skill

unsloth

Design

This skill provides expert guidance for fast fine-tuning with Unsloth, offering 2-5x faster training and 50-80% memory reduction. It helps developers implement and debug LoRA/QLoRA optimizations for models like Llama and Mistral. Use it when working with Unsloth's APIs, features, or best practices for efficient model training.

View skill