Zurück zu Fähigkeiten

gguf-quantization

davila7
Aktualisiert 16 days ago
529 Ansichten
18,478
1,685
18,478
Auf GitHub ansehen
DesignGGUFQuantizationllama.cppCPU InferenceApple SiliconModel CompressionOptimization

Über

Diese Fähigkeit ermöglicht GGUF-Quantisierung für effiziente Modellbereitstellung auf Consumer-Hardware wie CPUs und Apple Silicon. Sie bietet flexible 2-8 Bit Quantisierungsoptionen ohne GPU-Beschleunigung. Nutzen Sie sie, wenn Sie Modelle für lokale Inferenz-Tools oder ressourcenbeschränkte Umgebungen optimieren.

Schnellinstallation

Claude Code

Empfohlen
Primär
npx skills add davila7/claude-code-templates -a claude-code
Plugin-BefehlAlternativ
/plugin add https://github.com/davila7/claude-code-templates
Git CloneAlternativ
git clone https://github.com/davila7/claude-code-templates.git ~/.claude/skills/gguf-quantization

Kopieren Sie diesen Befehl und fügen Sie ihn in Claude Code ein, um diese Fähigkeit zu installieren

GitHub Repository

davila7/claude-code-templates
Pfad: cli-tool/components/skills/ai-research/optimization-gguf
0
anthropicanthropic-claudeclaudeclaude-code

Verwandte Skills

quantizing-models-bitsandbytes

Andere

This skill quantizes LLMs to 8-bit or 4-bit precision using bitsandbytes, achieving 50-75% memory reduction with minimal accuracy loss. It's ideal for running larger models on limited GPU memory or accelerating inference, supporting formats like INT8, NF4, and FP4. The skill integrates with HuggingFace Transformers and enables QLoRA training and 8-bit optimizers.

Skill ansehen

awq-quantization

Andere

AWQ is a 4-bit weight quantization technique that uses activation patterns to preserve critical weights, enabling 3x faster inference with minimal accuracy loss. It's ideal for deploying large models (7B-70B) on limited GPU memory and is particularly effective for instruction-tuned and multimodal models. This skill integrates with vLLM and Marlin kernels for optimized deployment.

Skill ansehen

hqq-quantization

Andere

HQQ enables fast, calibration-free quantization of LLMs down to 4/3/2-bit precision without needing a dataset. It's ideal for rapid quantization workflows and deployment with vLLM or HuggingFace Transformers. Key advantages include significantly faster quantization than methods like GPTQ and support for fine-tuning quantized models.

Skill ansehen

llama-cpp

Andere

The llama-cpp skill enables efficient LLM inference on CPU, Apple Silicon, and non-NVIDIA GPUs, making it ideal for edge deployment or when CUDA is unavailable. It supports GGUF quantization for reduced memory usage and offers significant speedups over PyTorch on CPU. Use this for Macs, AMD/Intel systems, or embedded devices, but choose TensorRT-LLM for NVIDIA hardware requiring maximum throughput.

Skill ansehen