grpo-rl-training
Über
Diese Fähigkeit bietet fachkundige Anleitung zur Implementierung von GRPO (Group Relative Policy Optimization) Reinforcement Learning Fine-Tuning mit der TRL-Bibliothek. Sie ist für das Training von Modellen für Aufgaben konzipiert, die strukturierte Ausgaben, überprüfbare Schlussfolgerungen oder objektive Korrektheitsmetriken wie bei Programmier- oder Mathematikaufgaben erfordern. Zu den Hauptmerkmalen gehören produktionsreife Workflows für benutzerdefinierte Belohnungsfunktionen und die Durchsetzung spezifischer Ausgabeformate.
Schnellinstallation
Claude Code
Empfohlennpx skills add davila7/claude-code-templates -a claude-code/plugin add https://github.com/davila7/claude-code-templatesgit clone https://github.com/davila7/claude-code-templates.git ~/.claude/skills/grpo-rl-trainingKopieren Sie diesen Befehl und fügen Sie ihn in Claude Code ein, um diese Fähigkeit zu installieren
GitHub Repository
Verwandte Skills
openrlhf-training
DesignOpenRLHF is a high-performance RLHF training framework for fine-tuning large language models (7B-70B+ parameters) using methods like PPO, DPO, and GRPO. It leverages Ray for distributed architecture and vLLM for accelerated inference, achieving speeds 2x faster than alternatives like DeepSpeedChat. Use this skill when you need efficient, distributed RLHF training with optimized GPU resource sharing and ZeRO-3 support.
fine-tuning-with-trl
AndereThis skill enables fine-tuning of LLMs using TRL's reinforcement learning methods including SFT, DPO, and PPO for RLHF and preference alignment. It's designed for aligning models with human feedback and works with HuggingFace Transformers. Use it when you need to implement RLHF, optimize with rewards, or train from human preferences.
gptq
AndereGPTQ is a 4-bit post-training quantization technique for LLMs that enables 4x memory reduction and 3-4x faster inference with minimal accuracy loss. It's ideal for deploying large models on consumer GPUs and integrates with transformers and PEFT for QLoRA fine-tuning. Use it when you need to fit 70B+ parameter models on limited hardware while maintaining performance.
instructor
TestenInstructor is a structured output library that extracts validated data from LLM responses using Pydantic schemas. It automatically retries failed extractions and provides type-safe JSON parsing with streaming support. Use it when you need reliable, validated data extraction from LLMs like OpenAI or Anthropic.
