gptq

davila7

Aktualisiert 15 days ago

257 Ansichten

18,478

1,685

18,478

AndereOptimizationGPTQQuantization4-BitPost-TrainingMemory OptimizationConsumer GPUsFast InferenceQLoRAGroup-Wise Quantization

Über

GPTQ ist eine 4-Bit-Post-Training-Quantisierungstechnik für LLMs, die eine 4-fache Speicherreduzierung und 3-4 mal schnellere Inferenz bei minimalem Genauigkeitsverlust ermöglicht. Sie ist ideal für den Einsatz großer Modelle auf Consumer-GPUs und integriert sich mit Transformers und PEFT für QLoRA-Fine-Tuning. Verwenden Sie sie, wenn Sie Modelle mit 70B+ Parametern auf begrenzter Hardware unter Beibehaltung der Leistung einsetzen müssen.

Schnellinstallation

Claude Code

GitHub Repository

davila7/claude-code-templates

Pfad: cli-tool/components/skills/ai-research/optimization-gptq

anthropicanthropic-claudeclaudeclaude-code

Verwandte Skills

quantizing-models-bitsandbytes

Andere

This skill quantizes LLMs to 8-bit or 4-bit precision using bitsandbytes, achieving 50-75% memory reduction with minimal accuracy loss. It's ideal for running larger models on limited GPU memory or accelerating inference, supporting formats like INT8, NF4, and FP4. The skill integrates with HuggingFace Transformers and enables QLoRA training and 8-bit optimizers.

Skill ansehen

gguf-quantization

Design

This skill enables GGUF quantization for efficient model deployment on consumer hardware like CPUs and Apple Silicon. It provides flexible 2-8 bit quantization options without requiring GPU acceleration. Use it when optimizing models for local inference tools or resource-constrained environments.

Skill ansehen

openrlhf-training

Design

OpenRLHF is a high-performance RLHF training framework for fine-tuning large language models (7B-70B+ parameters) using methods like PPO, DPO, and GRPO. It leverages Ray for distributed architecture and vLLM for accelerated inference, achieving speeds 2x faster than alternatives like DeepSpeedChat. Use this skill when you need efficient, distributed RLHF training with optimized GPU resource sharing and ZeRO-3 support.

Skill ansehen

awq-quantization

Andere

AWQ is a 4-bit weight quantization technique that uses activation patterns to preserve critical weights, enabling 3x faster inference with minimal accuracy loss. It's ideal for deploying large models (7B-70B) on limited GPU memory and is particularly effective for instruction-tuned and multimodal models. This skill integrates with vLLM and Marlin kernels for optimized deployment.

Skill ansehen