Back to Skills

rwkv-architecture

davila7
Updated Today
10 views
15,516
1,344
15,516
View on GitHub
OtherRWKVModel ArchitectureRNNTransformer HybridLinear ComplexityInfinite ContextEfficient InferenceLinux FoundationAlternative Architecture

About

This skill provides the RWKV model architecture, which combines Transformer-like parallel training with RNN-like efficient, linear-time inference. It enables infinite context handling without a KV cache and is used in production by Microsoft and NVIDIA. Developers should use it when they need a highly efficient, state-of-the-art alternative to standard Transformers for sequence modeling tasks.

Quick Install

Claude Code

Recommended
Plugin CommandRecommended
/plugin add https://github.com/davila7/claude-code-templates
Git CloneAlternative
git clone https://github.com/davila7/claude-code-templates.git ~/.claude/skills/rwkv-architecture

Copy and paste this command in Claude Code to install this skill

Documentation

RWKV - Receptance Weighted Key Value

Quick start

RWKV (RwaKuv) combines Transformer parallelization (training) with RNN efficiency (inference).

Installation:

# Install PyTorch
pip install torch --upgrade --extra-index-url https://download.pytorch.org/whl/cu121

# Install dependencies
pip install pytorch-lightning==1.9.5 deepspeed wandb ninja --upgrade

# Install RWKV
pip install rwkv

Basic usage (GPT mode + RNN mode):

import os
from rwkv.model import RWKV

os.environ["RWKV_JIT_ON"] = '1'
os.environ["RWKV_CUDA_ON"] = '1'  # Use CUDA kernel for speed

# Load model
model = RWKV(
    model='/path/to/RWKV-4-Pile-1B5-20220903-8040',
    strategy='cuda fp16'
)

# GPT mode (parallel processing)
out, state = model.forward([187, 510, 1563, 310, 247], None)
print(out.detach().cpu().numpy())  # Logits

# RNN mode (sequential processing, same result)
out, state = model.forward([187, 510], None)  # First 2 tokens
out, state = model.forward([1563], state)      # Next token
out, state = model.forward([310, 247], state)  # Last tokens
print(out.detach().cpu().numpy())  # Same logits as above!

Common workflows

Workflow 1: Text generation (streaming)

Efficient token-by-token generation:

from rwkv.model import RWKV
from rwkv.utils import PIPELINE

model = RWKV(model='RWKV-4-Pile-14B-20230313-ctx8192-test1050', strategy='cuda fp16')
pipeline = PIPELINE(model, "20B_tokenizer.json")

# Initial prompt
prompt = "The future of AI is"
state = None

# Generate token by token
for token in prompt:
    out, state = pipeline.model.forward(pipeline.encode(token), state)

# Continue generation
for _ in range(100):
    out, state = pipeline.model.forward(None, state)
    token = pipeline.sample_logits(out)
    print(pipeline.decode(token), end='', flush=True)

Key advantage: Constant memory per token (no growing KV cache)

Workflow 2: Long context processing (infinite context)

Process million-token sequences:

model = RWKV(model='RWKV-4-Pile-14B', strategy='cuda fp16')

# Process very long document
state = None
long_document = load_document()  # e.g., 1M tokens

# Stream through entire document
for chunk in chunks(long_document, chunk_size=1024):
    out, state = model.forward(chunk, state)

# State now contains information from entire 1M token document
# Memory usage: O(1) (constant, not O(n)!)

Workflow 3: Fine-tuning RWKV

Standard fine-tuning workflow:

# Training script
import pytorch_lightning as pl
from rwkv.model import RWKV
from rwkv.trainer import RWKVTrainer

# Configure model
config = {
    'n_layer': 24,
    'n_embd': 1024,
    'vocab_size': 50277,
    'ctx_len': 1024
}

# Setup trainer
trainer = pl.Trainer(
    accelerator='gpu',
    devices=8,
    precision='bf16',
    strategy='deepspeed_stage_2',
    max_epochs=1
)

# Train
model = RWKV(config)
trainer.fit(model, train_dataloader)

Workflow 4: RWKV vs Transformer comparison

Memory comparison (1M token sequence):

# Transformer (GPT)
# Memory: O(n²) for attention
# KV cache: 1M × hidden_dim × n_layers × 2 (keys + values)
# Example: 1M × 4096 × 24 × 2 = ~400GB (impractical!)

# RWKV
# Memory: O(1) per token
# State: hidden_dim × n_layers = 4096 × 24 = ~400KB
# 1,000,000× more efficient!

Speed comparison (inference):

# Transformer: O(n) per token (quadratic overall)
# First token: 1 computation
# Second token: 2 computations
# ...
# 1000th token: 1000 computations

# RWKV: O(1) per token (linear overall)
# Every token: 1 computation
# 1000th token: 1 computation (same as first!)

When to use vs alternatives

Use RWKV when:

  • Need very long context (100K+ tokens)
  • Want constant memory usage
  • Building streaming applications
  • Need RNN efficiency with Transformer performance
  • Memory-constrained deployment

Key advantages:

  • Linear time: O(n) vs O(n²) for Transformers
  • No KV cache: Constant memory per token
  • Infinite context: No fixed window limit
  • Parallelizable training: Like GPT
  • Sequential inference: Like RNN

Use alternatives instead:

  • Transformers: Need absolute best performance, have compute
  • Mamba: Want state-space models
  • RetNet: Need retention mechanism
  • Hyena: Want convolution-based approach

Common issues

Issue: Out of memory during training

Use gradient checkpointing and DeepSpeed:

trainer = pl.Trainer(
    strategy='deepspeed_stage_3',  # Full ZeRO-3
    precision='bf16'
)

Issue: Slow inference

Enable CUDA kernel:

os.environ["RWKV_CUDA_ON"] = '1'

Issue: Model not loading

Check model path and strategy:

model = RWKV(
    model='/absolute/path/to/model.pth',
    strategy='cuda fp16'  # Or 'cpu fp32' for CPU
)

Issue: State management in RNN mode

Always pass state between forward calls:

# WRONG: State lost
out1, _ = model.forward(tokens1, None)
out2, _ = model.forward(tokens2, None)  # No context from tokens1!

# CORRECT: State preserved
out1, state = model.forward(tokens1, None)
out2, state = model.forward(tokens2, state)  # Has context from tokens1

Advanced topics

Time-mixing and channel-mixing: See references/architecture-details.md for WKV operation, time-decay mechanism, and receptance gates.

State management: See references/state-management.md for att_x_prev, att_kv, ffn_x_prev states, and numerical stability considerations.

RWKV-7 improvements: See references/rwkv7.md for latest architectural improvements (March 2025) and multimodal capabilities.

Hardware requirements

  • GPU: NVIDIA (CUDA 11.6+) or CPU
  • VRAM (FP16):
    • 169M model: 1GB
    • 430M model: 2GB
    • 1.5B model: 4GB
    • 3B model: 8GB
    • 7B model: 16GB
    • 14B model: 32GB
  • Inference: O(1) memory per token
  • Training: Parallelizable like GPT

Performance (vs Transformers):

  • Speed: Similar training, faster inference
  • Memory: 1000× less for long sequences
  • Scaling: Linear vs quadratic

Resources

GitHub Repository

davila7/claude-code-templates
Path: cli-tool/components/skills/ai-research/model-architecture-rwkv
anthropicanthropic-claudeclaudeclaude-code

Related Skills

quantizing-models-bitsandbytes

Other

This skill quantizes LLMs to 8-bit or 4-bit precision using bitsandbytes, reducing memory usage by 50-75% with minimal accuracy loss for GPU-constrained environments. It supports multiple formats (INT8, NF4, FP4) and enables QLoRA training and 8-bit optimizers. Use it with HuggingFace Transformers when you need to fit larger models into limited memory or accelerate inference.

View skill

nanogpt

Development

nanoGPT is a minimal, educational GPT implementation in ~300 lines of clean code that reproduces GPT-2 (124M) for learning transformer architecture from scratch. Use it to train models on datasets like Shakespeare (CPU) or OpenWebText (multi-GPU) with hackable, understandable code. It's ideal for developers wanting to deeply understand GPT internals through practical experimentation.

View skill

mamba-architecture

Design

Mamba is a state-space model architecture offering linear O(n) complexity for sequence processing, making it significantly faster than Transformer models and capable of handling million-token contexts without a KV cache. It features a selective, hardware-aware SSM design with available model sizes from 130M to 2.8B parameters. Use this skill for efficient, long-context inference as an alternative to standard Transformers.

View skill

implementing-llms-litgpt

Other

This Claude Skill implements and trains LLMs using Lightning AI's LitGPT framework, providing 20+ clean, single-file model implementations like Llama and Mistral. It's ideal for developers needing educational architecture understanding or production fine-tuning with LoRA/QLoRA. The skill offers minimal abstraction layers for straightforward model loading, training, and text generation workflows.

View skill