rwkv-architecture
About
RWKV is a hybrid architecture that combines Transformer-like parallel training with RNN-like efficient inference, providing linear time complexity and infinite context without KV caching. It's ideal for applications requiring long-sequence processing with production-ready efficiency. The models scale up to 14B parameters and are used in major platforms like Windows and NVIDIA NeMo.
Quick Install
Claude Code
Recommended/plugin add https://github.com/zechenzhangAGI/AI-research-SKILLsgit clone https://github.com/zechenzhangAGI/AI-research-SKILLs.git ~/.claude/skills/rwkv-architectureCopy and paste this command in Claude Code to install this skill
Documentation
RWKV - Receptance Weighted Key Value
Quick start
RWKV (RwaKuv) combines Transformer parallelization (training) with RNN efficiency (inference).
Installation:
# Install PyTorch
pip install torch --upgrade --extra-index-url https://download.pytorch.org/whl/cu121
# Install dependencies
pip install pytorch-lightning==1.9.5 deepspeed wandb ninja --upgrade
# Install RWKV
pip install rwkv
Basic usage (GPT mode + RNN mode):
import os
from rwkv.model import RWKV
os.environ["RWKV_JIT_ON"] = '1'
os.environ["RWKV_CUDA_ON"] = '1' # Use CUDA kernel for speed
# Load model
model = RWKV(
model='/path/to/RWKV-4-Pile-1B5-20220903-8040',
strategy='cuda fp16'
)
# GPT mode (parallel processing)
out, state = model.forward([187, 510, 1563, 310, 247], None)
print(out.detach().cpu().numpy()) # Logits
# RNN mode (sequential processing, same result)
out, state = model.forward([187, 510], None) # First 2 tokens
out, state = model.forward([1563], state) # Next token
out, state = model.forward([310, 247], state) # Last tokens
print(out.detach().cpu().numpy()) # Same logits as above!
Common workflows
Workflow 1: Text generation (streaming)
Efficient token-by-token generation:
from rwkv.model import RWKV
from rwkv.utils import PIPELINE
model = RWKV(model='RWKV-4-Pile-14B-20230313-ctx8192-test1050', strategy='cuda fp16')
pipeline = PIPELINE(model, "20B_tokenizer.json")
# Initial prompt
prompt = "The future of AI is"
state = None
# Generate token by token
for token in prompt:
out, state = pipeline.model.forward(pipeline.encode(token), state)
# Continue generation
for _ in range(100):
out, state = pipeline.model.forward(None, state)
token = pipeline.sample_logits(out)
print(pipeline.decode(token), end='', flush=True)
Key advantage: Constant memory per token (no growing KV cache)
Workflow 2: Long context processing (infinite context)
Process million-token sequences:
model = RWKV(model='RWKV-4-Pile-14B', strategy='cuda fp16')
# Process very long document
state = None
long_document = load_document() # e.g., 1M tokens
# Stream through entire document
for chunk in chunks(long_document, chunk_size=1024):
out, state = model.forward(chunk, state)
# State now contains information from entire 1M token document
# Memory usage: O(1) (constant, not O(n)!)
Workflow 3: Fine-tuning RWKV
Standard fine-tuning workflow:
# Training script
import pytorch_lightning as pl
from rwkv.model import RWKV
from rwkv.trainer import RWKVTrainer
# Configure model
config = {
'n_layer': 24,
'n_embd': 1024,
'vocab_size': 50277,
'ctx_len': 1024
}
# Setup trainer
trainer = pl.Trainer(
accelerator='gpu',
devices=8,
precision='bf16',
strategy='deepspeed_stage_2',
max_epochs=1
)
# Train
model = RWKV(config)
trainer.fit(model, train_dataloader)
Workflow 4: RWKV vs Transformer comparison
Memory comparison (1M token sequence):
# Transformer (GPT)
# Memory: O(n²) for attention
# KV cache: 1M × hidden_dim × n_layers × 2 (keys + values)
# Example: 1M × 4096 × 24 × 2 = ~400GB (impractical!)
# RWKV
# Memory: O(1) per token
# State: hidden_dim × n_layers = 4096 × 24 = ~400KB
# 1,000,000× more efficient!
Speed comparison (inference):
# Transformer: O(n) per token (quadratic overall)
# First token: 1 computation
# Second token: 2 computations
# ...
# 1000th token: 1000 computations
# RWKV: O(1) per token (linear overall)
# Every token: 1 computation
# 1000th token: 1 computation (same as first!)
When to use vs alternatives
Use RWKV when:
- Need very long context (100K+ tokens)
- Want constant memory usage
- Building streaming applications
- Need RNN efficiency with Transformer performance
- Memory-constrained deployment
Key advantages:
- Linear time: O(n) vs O(n²) for Transformers
- No KV cache: Constant memory per token
- Infinite context: No fixed window limit
- Parallelizable training: Like GPT
- Sequential inference: Like RNN
Use alternatives instead:
- Transformers: Need absolute best performance, have compute
- Mamba: Want state-space models
- RetNet: Need retention mechanism
- Hyena: Want convolution-based approach
Common issues
Issue: Out of memory during training
Use gradient checkpointing and DeepSpeed:
trainer = pl.Trainer(
strategy='deepspeed_stage_3', # Full ZeRO-3
precision='bf16'
)
Issue: Slow inference
Enable CUDA kernel:
os.environ["RWKV_CUDA_ON"] = '1'
Issue: Model not loading
Check model path and strategy:
model = RWKV(
model='/absolute/path/to/model.pth',
strategy='cuda fp16' # Or 'cpu fp32' for CPU
)
Issue: State management in RNN mode
Always pass state between forward calls:
# WRONG: State lost
out1, _ = model.forward(tokens1, None)
out2, _ = model.forward(tokens2, None) # No context from tokens1!
# CORRECT: State preserved
out1, state = model.forward(tokens1, None)
out2, state = model.forward(tokens2, state) # Has context from tokens1
Advanced topics
Time-mixing and channel-mixing: See references/architecture-details.md for WKV operation, time-decay mechanism, and receptance gates.
State management: See references/state-management.md for att_x_prev, att_kv, ffn_x_prev states, and numerical stability considerations.
RWKV-7 improvements: See references/rwkv7.md for latest architectural improvements (March 2025) and multimodal capabilities.
Hardware requirements
- GPU: NVIDIA (CUDA 11.6+) or CPU
- VRAM (FP16):
- 169M model: 1GB
- 430M model: 2GB
- 1.5B model: 4GB
- 3B model: 8GB
- 7B model: 16GB
- 14B model: 32GB
- Inference: O(1) memory per token
- Training: Parallelizable like GPT
Performance (vs Transformers):
- Speed: Similar training, faster inference
- Memory: 1000× less for long sequences
- Scaling: Linear vs quadratic
Resources
- Paper (RWKV): https://arxiv.org/abs/2305.13048 (May 2023)
- Paper (RWKV-7): https://arxiv.org/abs/2503.14456 (March 2025)
- GitHub: https://github.com/BlinkDL/RWKV-LM ⭐ 12,000+
- Docs: https://wiki.rwkv.com/
- Models: https://huggingface.co/BlinkDL
- Linux Foundation AI: Official project
- Production: Microsoft Windows, Office integration, NeMo support
GitHub Repository
Related Skills
sglang
MetaSGLang is a high-performance LLM serving framework that specializes in fast, structured generation for JSON, regex, and agentic workflows using its RadixAttention prefix caching. It delivers significantly faster inference, especially for tasks with repeated prefixes, making it ideal for complex, structured outputs and multi-turn conversations. Choose SGLang over alternatives like vLLM when you need constrained decoding or are building applications with extensive prefix sharing.
evaluating-llms-harness
TestingThis Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.
llamaguard
OtherLlamaGuard is Meta's 7-8B parameter model for moderating LLM inputs and outputs across six safety categories like violence and hate speech. It offers 94-95% accuracy and can be deployed using vLLM, Hugging Face, or Amazon SageMaker. Use this skill to easily integrate content filtering and safety guardrails into your AI applications.
langchain
MetaLangChain is a framework for building LLM applications using agents, chains, and RAG pipelines. It supports multiple LLM providers, offers 500+ integrations, and includes features like tool calling and memory management. Use it for rapid prototyping and deploying production systems like chatbots, autonomous agents, and question-answering services.
