transformer-lens-interpretability
About
This skill helps developers perform mechanistic interpretability research on transformer models using the TransformerLens library. It enables inspecting and manipulating model internals through HookPoints and activation caching for tasks like reverse-engineering algorithms and activation patching experiments. Use it when you need to analyze attention patterns or conduct circuit analysis on GPT-style language models.
Quick Install
Claude Code
Recommended/plugin add https://github.com/davila7/claude-code-templatesgit clone https://github.com/davila7/claude-code-templates.git ~/.claude/skills/transformer-lens-interpretabilityCopy and paste this command in Claude Code to install this skill
Documentation
TransformerLens: Mechanistic Interpretability for Transformers
TransformerLens is the de facto standard library for mechanistic interpretability research on GPT-style language models. Created by Neel Nanda and maintained by Bryce Meyer, it provides clean interfaces to inspect and manipulate model internals via HookPoints on every activation.
GitHub: TransformerLensOrg/TransformerLens (2,900+ stars)
When to Use TransformerLens
Use TransformerLens when you need to:
- Reverse-engineer algorithms learned during training
- Perform activation patching / causal tracing experiments
- Study attention patterns and information flow
- Analyze circuits (e.g., induction heads, IOI circuit)
- Cache and inspect intermediate activations
- Apply direct logit attribution
Consider alternatives when:
- You need to work with non-transformer architectures → Use nnsight or pyvene
- You want to train/analyze Sparse Autoencoders → Use SAELens
- You need remote execution on massive models → Use nnsight with NDIF
- You want higher-level causal intervention abstractions → Use pyvene
Installation
pip install transformer-lens
For development version:
pip install git+https://github.com/TransformerLensOrg/TransformerLens
Core Concepts
HookedTransformer
The main class that wraps transformer models with HookPoints on every activation:
from transformer_lens import HookedTransformer
# Load a model
model = HookedTransformer.from_pretrained("gpt2-small")
# For gated models (LLaMA, Mistral)
import os
os.environ["HF_TOKEN"] = "your_token"
model = HookedTransformer.from_pretrained("meta-llama/Llama-2-7b-hf")
Supported Models (50+)
| Family | Models |
|---|---|
| GPT-2 | gpt2, gpt2-medium, gpt2-large, gpt2-xl |
| LLaMA | llama-7b, llama-13b, llama-2-7b, llama-2-13b |
| EleutherAI | pythia-70m to pythia-12b, gpt-neo, gpt-j-6b |
| Mistral | mistral-7b, mixtral-8x7b |
| Others | phi, qwen, opt, gemma |
Activation Caching
Run the model and cache all intermediate activations:
# Get all activations
tokens = model.to_tokens("The Eiffel Tower is in")
logits, cache = model.run_with_cache(tokens)
# Access specific activations
residual = cache["resid_post", 5] # Layer 5 residual stream
attn_pattern = cache["pattern", 3] # Layer 3 attention pattern
mlp_out = cache["mlp_out", 7] # Layer 7 MLP output
# Filter which activations to cache (saves memory)
logits, cache = model.run_with_cache(
tokens,
names_filter=lambda name: "resid_post" in name
)
ActivationCache Keys
| Key Pattern | Shape | Description |
|---|---|---|
resid_pre, layer | [batch, pos, d_model] | Residual before attention |
resid_mid, layer | [batch, pos, d_model] | Residual after attention |
resid_post, layer | [batch, pos, d_model] | Residual after MLP |
attn_out, layer | [batch, pos, d_model] | Attention output |
mlp_out, layer | [batch, pos, d_model] | MLP output |
pattern, layer | [batch, head, q_pos, k_pos] | Attention pattern (post-softmax) |
q, layer | [batch, pos, head, d_head] | Query vectors |
k, layer | [batch, pos, head, d_head] | Key vectors |
v, layer | [batch, pos, head, d_head] | Value vectors |
Workflow 1: Activation Patching (Causal Tracing)
Identify which activations causally affect model output by patching clean activations into corrupted runs.
Step-by-Step
from transformer_lens import HookedTransformer, patching
import torch
model = HookedTransformer.from_pretrained("gpt2-small")
# 1. Define clean and corrupted prompts
clean_prompt = "The Eiffel Tower is in the city of"
corrupted_prompt = "The Colosseum is in the city of"
clean_tokens = model.to_tokens(clean_prompt)
corrupted_tokens = model.to_tokens(corrupted_prompt)
# 2. Get clean activations
_, clean_cache = model.run_with_cache(clean_tokens)
# 3. Define metric (e.g., logit difference)
paris_token = model.to_single_token(" Paris")
rome_token = model.to_single_token(" Rome")
def metric(logits):
return logits[0, -1, paris_token] - logits[0, -1, rome_token]
# 4. Patch each position and layer
results = torch.zeros(model.cfg.n_layers, clean_tokens.shape[1])
for layer in range(model.cfg.n_layers):
for pos in range(clean_tokens.shape[1]):
def patch_hook(activation, hook):
activation[0, pos] = clean_cache[hook.name][0, pos]
return activation
patched_logits = model.run_with_hooks(
corrupted_tokens,
fwd_hooks=[(f"blocks.{layer}.hook_resid_post", patch_hook)]
)
results[layer, pos] = metric(patched_logits)
# 5. Visualize results (layer x position heatmap)
Checklist
- Define clean and corrupted inputs that differ minimally
- Choose metric that captures behavior difference
- Cache clean activations
- Systematically patch each (layer, position) combination
- Visualize results as heatmap
- Identify causal hotspots
Workflow 2: Circuit Analysis (Indirect Object Identification)
Replicate the IOI circuit discovery from "Interpretability in the Wild".
Step-by-Step
from transformer_lens import HookedTransformer
import torch
model = HookedTransformer.from_pretrained("gpt2-small")
# IOI task: "When John and Mary went to the store, Mary gave a bottle to"
# Model should predict "John" (indirect object)
prompt = "When John and Mary went to the store, Mary gave a bottle to"
tokens = model.to_tokens(prompt)
# 1. Get baseline logits
logits, cache = model.run_with_cache(tokens)
john_token = model.to_single_token(" John")
mary_token = model.to_single_token(" Mary")
# 2. Compute logit difference (IO - S)
logit_diff = logits[0, -1, john_token] - logits[0, -1, mary_token]
print(f"Logit difference: {logit_diff.item():.3f}")
# 3. Direct logit attribution by head
def get_head_contribution(layer, head):
# Project head output to logits
head_out = cache["z", layer][0, :, head, :] # [pos, d_head]
W_O = model.W_O[layer, head] # [d_head, d_model]
W_U = model.W_U # [d_model, vocab]
# Head contribution to logits at final position
contribution = head_out[-1] @ W_O @ W_U
return contribution[john_token] - contribution[mary_token]
# 4. Map all heads
head_contributions = torch.zeros(model.cfg.n_layers, model.cfg.n_heads)
for layer in range(model.cfg.n_layers):
for head in range(model.cfg.n_heads):
head_contributions[layer, head] = get_head_contribution(layer, head)
# 5. Identify top contributing heads (name movers, backup name movers)
Checklist
- Set up task with clear IO/S tokens
- Compute baseline logit difference
- Decompose by attention head contributions
- Identify key circuit components (name movers, S-inhibition, induction)
- Validate with ablation experiments
Workflow 3: Induction Head Detection
Find induction heads that implement [A][B]...[A] → [B] pattern.
from transformer_lens import HookedTransformer
import torch
model = HookedTransformer.from_pretrained("gpt2-small")
# Create repeated sequence: [A][B][A] should predict [B]
repeated_tokens = torch.tensor([[1000, 2000, 1000]]) # Arbitrary tokens
_, cache = model.run_with_cache(repeated_tokens)
# Induction heads attend from final [A] back to first [B]
# Check attention from position 2 to position 1
induction_scores = torch.zeros(model.cfg.n_layers, model.cfg.n_heads)
for layer in range(model.cfg.n_layers):
pattern = cache["pattern", layer][0] # [head, q_pos, k_pos]
# Attention from pos 2 to pos 1
induction_scores[layer] = pattern[:, 2, 1]
# Heads with high scores are induction heads
top_heads = torch.topk(induction_scores.flatten(), k=5)
Common Issues & Solutions
Issue: Hooks persist after debugging
# WRONG: Old hooks remain active
model.run_with_hooks(tokens, fwd_hooks=[...]) # Debug, add new hooks
model.run_with_hooks(tokens, fwd_hooks=[...]) # Old hooks still there!
# RIGHT: Always reset hooks
model.reset_hooks()
model.run_with_hooks(tokens, fwd_hooks=[...])
Issue: Tokenization gotchas
# WRONG: Assuming consistent tokenization
model.to_tokens("Tim") # Single token
model.to_tokens("Neel") # Becomes "Ne" + "el" (two tokens!)
# RIGHT: Check tokenization explicitly
tokens = model.to_tokens("Neel", prepend_bos=False)
print(model.to_str_tokens(tokens)) # ['Ne', 'el']
Issue: LayerNorm ignored in analysis
# WRONG: Ignoring LayerNorm
pre_activation = residual @ model.W_in[layer]
# RIGHT: Include LayerNorm
ln_scale = model.blocks[layer].ln2.w
ln_out = model.blocks[layer].ln2(residual)
pre_activation = ln_out @ model.W_in[layer]
Issue: Memory explosion with large models
# Use selective caching
logits, cache = model.run_with_cache(
tokens,
names_filter=lambda n: "resid_post" in n or "pattern" in n,
device="cpu" # Cache on CPU
)
Key Classes Reference
| Class | Purpose |
|---|---|
HookedTransformer | Main model wrapper with hooks |
ActivationCache | Dictionary-like cache of activations |
HookedTransformerConfig | Model configuration |
FactoredMatrix | Efficient factored matrix operations |
Integration with SAELens
TransformerLens integrates with SAELens for Sparse Autoencoder analysis:
from transformer_lens import HookedTransformer
from sae_lens import SAE
model = HookedTransformer.from_pretrained("gpt2-small")
sae = SAE.from_pretrained("gpt2-small-res-jb", "blocks.8.hook_resid_pre")
# Run with SAE
tokens = model.to_tokens("Hello world")
_, cache = model.run_with_cache(tokens)
sae_acts = sae.encode(cache["resid_pre", 8])
Reference Documentation
For detailed API documentation, tutorials, and advanced usage, see the references/ folder:
| File | Contents |
|---|---|
| references/README.md | Overview and quick start guide |
| references/api.md | Complete API reference for HookedTransformer, ActivationCache, HookPoints |
| references/tutorials.md | Step-by-step tutorials for activation patching, circuit analysis, logit lens |
External Resources
Tutorials
- Main Demo Notebook
- Activation Patching Demo
- ARENA Mech Interp Course - 200+ hours of tutorials
Papers
- A Mathematical Framework for Transformer Circuits
- In-context Learning and Induction Heads
- Interpretability in the Wild (IOI)
Official Documentation
Version Notes
- v2.0: Removed HookedSAE (moved to SAELens)
- v3.0 (alpha): TransformerBridge for loading any nn.Module
GitHub Repository
Related Skills
pyvene-interventions
TestingThis skill provides guidance for performing causal interventions like activation patching and causal tracing on PyTorch models using the pyvene library. It helps developers test causal hypotheses about model behavior through a declarative, dict-based framework. Use it when you need to conduct reproducible interchange intervention experiments for model interpretability.
sparse-autoencoder-training
DesignThis Claude Skill provides guidance for training and analyzing Sparse Autoencoders (SAEs) using the SAELens library to decompose complex neural network activations into interpretable features. Use it when you need to discover monosemantic features, analyze superposition, or study model representations for mechanistic interpretability. It helps implement SAEs to transform polysemantic activations into sparse, understandable components.
nnsight-remote-interpretability
DesignThis skill enables interpretability experiments on PyTorch models by providing transparent access to neural network internals using the nnsight library. It uniquely supports remote execution via NDIF for massive models (70B+) without requiring local GPU resources. Use it when you need to run the same interpretability code locally on small models or remotely on large-scale foundation models.
web-cli-teleport
DesignThis skill helps developers choose between Claude Code Web and CLI interfaces based on task complexity and workflow needs. It enables seamless teleportation of sessions between environments to maintain context and optimize productivity. Use it for session management and to determine the best interface for coding tasks requiring different levels of iteration or back-and-forth interaction.
