hqq-quantization

davila7

Aktualisiert 16 days ago

302 Ansichten

18,478

1,685

18,478

Auf GitHub ansehen

AndereQuantizationHQQOptimizationMemory EfficiencyInferenceModel Compression

Über

HQQ ermöglicht eine schnelle, kalibrierungsfreie Quantisierung von LLMs bis zu 4/3/2-Bit-Präzision ohne den Bedarf eines Datensatzes. Es ist ideal für schnelle Quantisierungs-Workflows und den Einsatz mit vLLM oder HuggingFace Transformers. Zu den Hauptvorteilen zählen eine deutlich schnellere Quantisierung als bei Methoden wie GPTQ und die Unterstützung für das Fine-Tuning quantisierter Modelle.

Schnellinstallation

Claude Code

GitHub Repository

davila7/claude-code-templates

Pfad: cli-tool/components/skills/ai-research/optimization-hqq

anthropicanthropic-claudeclaudeclaude-code

Verwandte Skills

quantizing-models-bitsandbytes

Andere

This skill quantizes LLMs to 8-bit or 4-bit precision using bitsandbytes, achieving 50-75% memory reduction with minimal accuracy loss. It's ideal for running larger models on limited GPU memory or accelerating inference, supporting formats like INT8, NF4, and FP4. The skill integrates with HuggingFace Transformers and enables QLoRA training and 8-bit optimizers.

Skill ansehen

gguf-quantization

Design

This skill enables GGUF quantization for efficient model deployment on consumer hardware like CPUs and Apple Silicon. It provides flexible 2-8 bit quantization options without requiring GPU acceleration. Use it when optimizing models for local inference tools or resource-constrained environments.

Skill ansehen

awq-quantization

Andere

AWQ is a 4-bit weight quantization technique that uses activation patterns to preserve critical weights, enabling 3x faster inference with minimal accuracy loss. It's ideal for deploying large models (7B-70B) on limited GPU memory and is particularly effective for instruction-tuned and multimodal models. This skill integrates with vLLM and Marlin kernels for optimized deployment.

Skill ansehen

lambda-labs-gpu-cloud

Andere

This Claude Skill provisions dedicated GPU cloud instances from Lambda Labs for ML training and inference. It's ideal for developers needing full SSH access, persistent storage, or large multi-node clusters with pre-installed stacks like PyTorch. Use it for long-running jobs where simple pricing and high-performance GPUs are required.

Skill ansehen