SKILL·96ED66

analyze-kernel-bottleneck

Name: analyze-kernel-bottleneck
Author: pjt222

pjt222

Updated 1 month ago

8 views

Othergeneral

About

This skill analyzes GPU kernels to classify them as compute-bound, memory-bound, or latency-bound using roofline analysis, occupancy calculations, and SASS inspection. It provides a decision matrix to recommend optimization strategies like tiling or warp interleaving. Use it before optimizing any CUDA kernel to establish a baseline and identify the correct optimization path.

Quick Install

Claude Code

Recommended

Primary

npx skills add pjt222/agent-almanac -a claude-code

Plugin CommandAlternative

/plugin add https://github.com/pjt222/agent-almanac

Git CloneAlternative

git clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/analyze-kernel-bottleneck

Copy and paste this command in Claude Code to install this skill

Documentation

析核心瓶頸

藉量基線效能、於 roofline 分類、計算每瓦片之佔用率與計算／載入比、檢查 SASS 指令組合與停滯碼、檢查共享記憶體懸崖，並施決策矩陣以擇優化策略，系統化辨 GPU 核心為計算受限、記憶體受限或延遲受限。

適用時機

優化任 CUDA 核心前——立基線並分類瓶頸類型
寫一首工版核心後以辨優化路徑
核心相對理論峰之表現低於預期時
於 cp.async、更大瓦片、或演算法重構之間決策時

輸入

必要：已編譯核心（.cubin 或 .cu 源附建構命令）
必要：以 CUDA event 計時啟核心之基準工具
必要：問題維度（如 GEMM 之 M、N、K；attention 之 seq_len、heads、head_dim）
選擇性：目標 GPU 架構（預設：GA104 / sm_86 / RTX 3070 Ti）
選擇性：比對之預期峰用率
選擇性：先前剖析資料（Nsight Compute 報告）

步驟

步驟一：量基線效能

以 CUDA event（BenchTimer）行核心，記時於毫秒。計算有效吞吐量指標：

編譯核心（若未建）：

nvcc --cubin -arch=sm_86 -O2 -o kernel.sm_86.cubin kernel.cu
nvcc -arch=sm_86 -O2 -o bench bench.cu -lcuda -I../../phase2/common

行代表性問題規模，確保暖機先於量測：
```
./bench 4096 4096 4096
```
記核心時於 ms，自 CUDA event（非牆鐘）。
計有效 GFLOPS 與有效頻寬：
- GEMM：effective_gflops = (2 * M * N * K) / (time_ms / 1000) / 1e9
- 頻寬限核心：effective_bw = total_bytes / (time_ms / 1000) / 1e9
- Flash Attention：effective_gflops = (4 * batch * heads * seq_len^2 * head_dim) / (time_ms / 1000) / 1e9

預期： 基線數字：核心時於 ms、有效 GFLOPS、有效頻寬。

失敗時： 查核心啟動而無誤（CHECK_CU 巨集）。驗暖機先於量測。確保問題維度夠大以飽和 GPU（小問題或於啟動開銷瓶頸）。

步驟二：於 roofline 上分類

計算算術強度並對機器平衡點比，以分類核心：

計算算術強度：AI = FLOPs / bytes_loaded_from_global_memory。僅計自 DRAM 載入之獨特位元組（非共享記憶體或暫存器重用）。
查機器平衡點：balance = peak_compute / peak_bandwidth。
分類：若 AI < balance，則記憶體受限。若 AI > balance，則計算受限。

GA104 (RTX 3070 Ti) 參考值：

Resource	Peak	Unit
FP32 FFMA	21.7	TFLOPS
FP16 Tensor Core (HMMA)	174	TFLOPS
INT8 Tensor Core (IMMA)	696	TOPS
DRAM Bandwidth	608	GB/s
L2 Cache	4	MB
SMs	48

衍生平衡點：

Precision	Balance Point (FLOP/byte)
FP32 FFMA	21700 / 608 = 35.7
FP16 TC	174000 / 608 = 286.2
INT8 TC	696000 / 608 = 1144.7

計算所達分數：attained = effective_throughput / peak_throughput。若記憶體受限：比有效頻寬與 608 GB/s。若計算受限：比有效 GFLOPS 與相關峰。

預期： 以數值理由分類為計算受限、記憶體受限或延遲受限（低佔用使計算亦未飽和、記憶體亦未飽和）。

失敗時： 重查位元組計數。慎防冗餘重讀（如直接 conv2d 無 im2col 致 9 倍重讀）。若計算與記憶體皆未飽和，核心恐為延遲受限（見步驟三）。

步驟三：計算佔用率

由啟動配置與資源使用判定每 SM 之活躍 warp：

提取資源使用：

nvcc --cubin -arch=sm_86 -O2 --resource-usage -o kernel.sm_86.cubin kernel.cu 2>&1 | grep -E 'registers|smem'

由啟動配置：warps_per_block = threads_per_block / 32。
計算 blocks/SM，自各限制因素：
- 暫存器限：floor(65536 / (registers_per_thread * threads_per_block))
- Smem 限：floor(available_smem_per_SM / smem_per_block)——懸崖見步驟六
- Warp 限：floor(48 / warps_per_block)（GA104 最大：48 warps/SM）
- Block 限：GA104 最大 16 blocks/SM
實際 blocks/SM = min(register_limit, smem_limit, warp_limit, block_limit)。
活躍 warps/SM = blocks_per_SM * warps_per_block。
關鍵閾值：8 warps/SM 於 GA104 足以隱藏延遲。低於 8 = 結構問題致延遲受限行為。

預期： 佔用率表呈 blocks/SM、活躍 warps/SM、限制因素（暫存器、smem 或 warp）。

失敗時： 為動態共享記憶體查 cuFuncSetAttribute。驗 --resource-usage 之回報合於實際啟動配置。若暫存器數出乎意料地高，試 --maxrregcount=N 以限暫存器（以暫存器溢位換佔用率）。

步驟四：計算每瓦片之計算／載入比

自 SASS（非源碼）計算每 K 瓦片之計算指令與載入位元組：

反組譯：

cuobjdump -sass kernel.sm_86.cubin > kernel.sass

計算每瓦片之計算指令（內迴圈於一 K 瓦片）：
- grep -c 'HMMA' kernel.sass — FP16 Tensor Core ops
- grep -c 'IMMA' kernel.sass — INT8 Tensor Core ops
- grep -c 'FFMA' kernel.sass — FP32 fused multiply-add
計算每瓦片之全域載入：
- grep -c 'LDG' kernel.sass — global memory loads
- 乘以每載入之位元組（典型 LDG.128 為 16 位元組）
計算比：每瓦片之 compute_ops / load_ops。
以 cp.async 決策閾值分類（自 gpu_reflections.md Insight 2）：
- 高（>20:1）：cp.async 淨負；warp interleaving 已隱藏 DRAM 延遲。重於演算法變動。參考：Flash Attention 每瓦片 64 HMMA = 高比，cp.async 量得 -5%。
- 中（5-20:1）：cp.async 或助，二路皆基準。
- 低（<5:1）：cp.async 大有助；載入主導，async copy 隱延遲。參考：IGEMM 每瓦片 8 IMMA = 低比，cp.async 量得 +35%。

預期： 計算／載入比附分類（高／中／低）與 cp.async 建議。

失敗時： 自 SASS 反組譯計，非源碼——編譯器或融合、消除或重排指令。確僅計內迴圈內之指令（K 瓦片迭代），非整核心。

步驟五：檢查 SASS 指令

審完整 SASS 指令組合與停滯碼：

反組譯（若於步驟四未行）：

cuobjdump -sass kernel.sm_86.cubin > kernel.sass

計關鍵指令類型：

grep -c 'HMMA.16816' kernel.sass      # FP16 Tensor Core
grep -c 'IMMA.16816' kernel.sass      # INT8 Tensor Core
grep -c 'FFMA' kernel.sass            # FP32 fused multiply-add
grep -c 'LDGSTS' kernel.sass          # cp.async (global->shared)
grep -c 'LDG' kernel.sass             # Global load
grep -c 'STS' kernel.sass             # Shared store
grep -c 'LDS' kernel.sass             # Shared load
grep -c 'BAR.SYNC' kernel.sass        # Barrier synchronization
grep -c 'SHFL' kernel.sass            # Warp shuffle (reductions)
grep -c 'MUFU' kernel.sass            # Special function unit

查關鍵指令之停滯碼：

grep 'HMMA' kernel.sass | head -5     # Expect S08 minimum (hardware constraint)
grep 'IMMA' kernel.sass | head -5     # Compiler emits S04, reducible to S02 via CuAssembler
grep 'FFMA' kernel.sass | head -5     # Check for S04 (reducible to S01 on independent FFMAs)

辨優化目標：
- HMMA S08 停滯：Ampere 上之硬體下限，不可降。重於他處
- IMMA S04 停滯：編譯器保守。CuAssembler 可緊至 S02（量得 15-20% 增益）
- FFMA S04 停滯：若獨立，經 CuAssembler 可降至 S01
- 過量 BAR.SYNC：或表管線階段間過度同步

預期： 指令計數表與停滯碼摘要附辨識之優化目標。

失敗時： 確保 cuobjdump 之架構合於核心編譯之目標（皆須 sm_86）。若 SASS 輸出為空，cubin 或損——重編。

步驟六：檢查 Smem 懸崖

判共享記憶體用量是否越架構特有之佔用率懸崖：

讀 smem/block 自 --resource-usage 輸出（步驟三）或 cuobjdump --res-usage kernel.sm_86.cubin。
比懸崖閾值：
- GA104 (sm_86)：100 KB 最大 smem/SM。懸崖於 50 KB/block。
- 經驗確認：48 KB/block -> 2 blocks/SM（佳）、56 KB/block -> 1 block/SM（2 倍退化）。
若於懸崖之上（smem > 50 KB/block）：
- Blocks/SM 降至 1，活躍 warp 降至 warps_per_block（典型 4）
- 預期 2 倍效能退化，因 DRAM 停滯曝露
查雙緩衝影響：雙緩衝倍 smem 用量。若當前 smem 為 30 KB，雙緩衝 = 60 KB，越懸崖。評 async 益是否勝佔用損。
記 smem/block、blocks/SM、是否越懸崖。

預期： smem/block 值附 blocks/SM 計與 50 KB 懸崖是否越之明確聲明。

失敗時： 若於懸崖之上且佔用為瓶頸，優化策略須變：減瓦片大以使 smem 低於 50 KB，或接 1 block/SM 而以每瓦片更高之計算／載入比補（更多暫存器重用、更長 K 瓦片）。

步驟七：建決策矩陣

將步驟二至六之發現綜為優化策略：

Condition	Strategy
Memory-bound + low compute/load ratio (<5:1) + smem under cliff	Software pipelining with cp.async (LDGSTS). Overlap global loads with compute.
Memory-bound + high compute/load ratio (>20:1) + 8+ warps	Warp interleaving already hides latency. Focus on algorithmic changes: implicit GEMM, split-Q, im2col.
Compute-bound + FFMA-heavy	CuAssembler stall code tightening: S04 -> S01 on independent FFMAs.
Compute-bound + HMMA-heavy	S08 is hardware minimum, cannot reduce. Increase tile reuse (larger M/N tiles, longer K-loop).
Compute-bound + IMMA-heavy	CuAssembler: S04 -> S02 on IMMA instructions (compiler is conservative).
Latency-bound (low occupancy, neither saturated)	Reduce smem or registers to get more blocks/SM. Get above 8 warps/SM.
Smem above cliff	Reduce tile size or restructure to get smem/block under 50 KB (GA104).

以預期增益排序適用之策略，用計算／載入比與佔用資料。
估各策略之增益範圍，依核心離相關天花板之距。
標衝突：如 cp.async 倍 smem（或越懸崖），更大瓦片增暫存器壓力（或減佔用）。

預期： 排序之建議優化清單，附預測之增益範圍與潛在衝突。

失敗時： 若無明顯勝者現，行隔離各策略之微基準（如獨測 cp.async、獨測減瓦片大），合前以量實際影響。

步驟八：記錄發現

產結構化之瓶頸報告：

基線：核心時、有效 GFLOPS、有效頻寬、問題維度
roofline 位置：算術強度、分類、所達峰之分數
佔用：blocks/SM、活躍 warps/SM、限制因素
計算／載入比：比值、分類（高／中／低）、cp.async 建議
SASS 摘要：指令計數表、停滯碼發現、CuAssembler 目標
Smem 懸崖：smem/block、blocks/SM、懸崖狀態
建議：排序之優化策略附增益估計

## Bottleneck Analysis Report: [kernel_name]

### Baseline
- Problem: [dimensions]
- Kernel time: [X] ms
- Effective GFLOPS: [Y] | Effective BW: [Z] GB/s

### Roofline Classification
- Arithmetic intensity: [AI] FLOP/byte
- Balance point: [BP] FLOP/byte ([precision])
- Classification: **[compute|memory|latency]-bound**
- Attained fraction: [X]% of peak

### Occupancy
| Resource | Per Block | Limit/SM | Blocks/SM |
|----------|-----------|----------|-----------|
| Registers | [N]/thread | 65536 | [B] |
| Shared mem | [X] KB | 100 KB (cliff: 50 KB) | [B] |
| Warps | [W] | 48 | [B] |
| **Limiting** | | | **[min(B)]** |
- Active warps/SM: [W] ([sufficient|insufficient] for latency hiding)

### Compute/Load Ratio
- Compute ops/tile: [N] [HMMA|IMMA|FFMA]
- Load bytes/tile: [N] bytes ([N] LDG x [N] bytes)
- Ratio: [X]:1 — **[high|medium|low]**
- cp.async recommendation: [beneficial|neutral|detrimental]

### SASS Instruction Mix
| Instruction | Count | Notes |
|-------------|-------|-------|
| HMMA.16816 | [N] | Stall: S08 (hardware min) |
| IMMA.16816 | [N] | Stall: S04 (reducible to S02) |
| FFMA | [N] | Stall: S04 (reducible to S01) |
| LDG | [N] | |
| LDGSTS | [N] | cp.async |
| BAR.SYNC | [N] | |

### Smem Cliff
- Smem/block: [X] KB — [under|over] 50 KB cliff
- Blocks/SM: [B] — [no occupancy loss|occupancy halved]

### Recommended Optimizations (ranked)
1. [Strategy] — estimated [X-Y]% gain
2. [Strategy] — estimated [X-Y]% gain
3. [Strategy] — estimated [X-Y]% gain

預期： 完整 markdown 報告，可供核心優化代理或人類開發者消用。

失敗時： 以不同問題大（如 1024、2048、4096、8192）重行以證發現非規模特有。小問題或現延遲受限而真實規模上之瓶頸為記憶體頻寬。

驗證

基線以 CUDA event（非牆鐘）量
roofline 分類已定（計算／記憶體／延遲受限）
佔用率已計，限制因素已辨
每瓦片計算／載入比自 SASS 計算
SASS 指令組合與停滯碼已記
Smem 懸崖已對架構閾值查
決策矩陣已施附策略建議
發現於結構化報告中已記

常見陷阱

重讀倍乘：直接 conv2d 無 im2col 將每權重讀 9 次，膨脹位元組計數 9 倍。計算算術強度時用 DRAM 實際載入之獨特位元組，非載入指令總數
混 FP16 Tensor Core 峰與 FP32 峰：FP16 TC 峰為 174 TFLOPS，FP32 FFMA 峰為 21.7 TFLOPS——8 倍之差。用錯峰使 roofline 分類無意義
GA104 上以 64 KB 為 smem 懸崖而非 50 KB：GA104 (sm_86) 最大 100 KB smem/SM。懸崖於 100/2 = 50 KB/block，非 64 KB。此架構特有；他 GPU 異
評 cp.async 時忽 warp interleaving：8 warp 附長計算階段（高計算／載入比）已透過 warp 排程隱 DRAM 延遲。於此區增 cp.async 增 smem 壓力與屏障開銷而無益（Flash Attention 量得 -5%）
自源碼計指令而非 SASS：編譯器或融合操作、消死碼、不同地展開迴圈或重排指令。恆自 cuobjdump -sass 輸出計
未行暖機迭代：首次核心啟含 JIT 編譯開銷與冷快取效應。量測前恆行 2-5 次暖機

GitHub Repository

pjt222/agent-almanac

Path: i18n/wenyan-lite/skills/analyze-kernel-bottleneck

agentsagentskillsai-assisted-developmentclaude-codeskillsteams

FAQ

Frequently asked questions

What is the analyze-kernel-bottleneck skill?

analyze-kernel-bottleneck is a Claude Skill by pjt222. Skills package instructions and resources that Claude loads on demand, so Claude can perform analyze-kernel-bottleneck-related tasks without extra prompting.

How do I install analyze-kernel-bottleneck?

Use the install commands on this page: add analyze-kernel-bottleneck to Claude Code as a plugin, or clone its repository into your skills directory, then restart Claude so it picks up the skill.

What category does analyze-kernel-bottleneck belong to?

analyze-kernel-bottleneck is in the Other category, tagged general.

Is analyze-kernel-bottleneck free to use?

Yes. analyze-kernel-bottleneck is listed on AIMCP and free to install. It runs inside Claude, so no separate service account is required to use the skill itself.

Related Skills

llamaguard

Other

LlamaGuard is Meta's 7-8B parameter model for moderating LLM inputs and outputs across six safety categories like violence and hate speech. It offers 94-95% accuracy and can be deployed using vLLM, Hugging Face, or Amazon SageMaker. Use this skill to easily integrate content filtering and safety guardrails into your AI applications.

View skill

cost-optimization

Other

This Claude Skill helps developers optimize cloud costs through resource rightsizing, tagging strategies, and spending analysis. It provides a framework for reducing cloud expenses and implementing cost governance across AWS, Azure, and GCP. Use it when you need to analyze infrastructure costs, right-size resources, or meet budget constraints.

View skill

sports-betting-analyzer

Other

This Claude Skill analyzes sports betting markets including spreads, over/unders, and prop bets by examining historical trends and situational statistics to identify value bets. It provides structured markdown output with actionable recommendations for educational purposes. Developers should use this for sports betting analysis tools while noting it's designed for entertainment/education only.

View skill

quantizing-models-bitsandbytes

Other

This skill quantizes LLMs to 8-bit or 4-bit precision using bitsandbytes, achieving 50-75% memory reduction with minimal accuracy loss. It's ideal for running larger models on limited GPU memory or accelerating inference, supporting formats like INT8, NF4, and FP4. The skill integrates with HuggingFace Transformers and enables QLoRA training and 8-bit optimizers.

View skill

analyze-kernel-bottleneck

About

Quick Install

Claude Code

Documentation

析核心瓶頸

適用時機

輸入

步驟

步驟一：量基線效能

步驟二：於 roofline 上分類

步驟三：計算佔用率

步驟四：計算每瓦片之計算／載入比

步驟五：檢查 SASS 指令

步驟六：檢查 Smem 懸崖

步驟七：建決策矩陣

步驟八：記錄發現

驗證

常見陷阱

相關技能

GitHub Repository

Frequently asked questions

What is the analyze-kernel-bottleneck skill?

How do I install analyze-kernel-bottleneck?

What category does analyze-kernel-bottleneck belong to?

Is analyze-kernel-bottleneck free to use?

Related Skills