pipeline-gpu-kernel
About
This skill applies software pipelining (double-buffering) to GPU kernels to overlap global memory loads with Tensor Core computations. It restructures kernel code into prologue, loop, and epilogue sections, selecting between LDG-register or cp.async methods based on compute/load ratios. Use it when `analyze-kernel-bottleneck` identifies memory-bound kernels with low compute-to-load ratios for optimization.
Quick Install
Claude Code
Recommendednpx skills add pjt222/agent-almanac -a claude-code/plugin add https://github.com/pjt222/agent-almanacgit clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/pipeline-gpu-kernelCopy and paste this command in Claude Code to install this skill
Documentation
管核
雙緩→載算疊也。轉序載-同-算-同 K-環為前-環-後三段,按算/載比擇 LDG 暫存或 cp.async (LDGSTS),驗共存於占崖下,視 SASS 證疊。
用
analyze-kernel-bottleneck識為憶束、低算/載比→用- 束交不掩 DRAM 遲(GA104 約 300 週期)→用
- 核有序載-同-算-同 K-環可重構→用
- 算/載比高(>20:1)且 ≥8 束活→不需
入
- 必:CUDA 核源(
.cu)含 K-環、載算分段 - 必:標 GPU 架(如 GA104 / sm_86,定共存崖與占限)
- 必:當前塊大小(BM、BN、BK)、型(FP16、FP32、INT8)
- 可:每塊算/載比(自
analyze-kernel-bottleneck,未給則估) - 可:基準(未管時於標問題大小之效)
行
一:驗前提
確核含 K-環、載算分以 __syncthreads() 隔。算雙共存,驗於架占崖下。
- 尋 K-環。當有此序:自全域載 A、B 塊入共存、
__syncthreads()、於共存塊算(HMMA/IMMA/FFMA)、__syncthreads()。 - 記單緩共存大小:
smem_a_size = BM * BK * sizeof(T)、smem_b_size = BK * BN * sizeof(T)。 - 算雙緩本:
smem_doubled = smem_a_size * 2 + smem_b_size * 2。 - 對架崖。GA104 (sm_86):每 SM 100 KB 共存頂、崖於每塊 50 KB(過 50 KB = 每 SM 1 塊 = 4 束、占減半)。
Single buffer: smem_a[BM*BK] + smem_b[BK*BN] = 2 KB + 2 KB = 4 KB
Double buffer: smem_a[2][BM*BK] + smem_b[2][BK*BN] = 4 KB + 4 KB = 8 KB
8 KB << 50 KB cliff -> 2 blocks/SM -> 8 warps
- 驗環次:
num_tiles = K / BK。管需num_tiles >= 2(至少前一+主環一)。
得:共存表示單雙緩本,雙配於崖下、每 SM ≥2 塊。
敗:雙緩過崖→減塊(半 BK 或 BM)至 smem_doubled <= 50 KB(GA104)。或用純暫存預取(LDG 變)不雙共存——預取於暫存,__syncthreads() 後寫同單緩。
二:擇變
按每塊算/載比擇 LDG 暫存或 cp.async (LDGSTS)。
- 算比:
ratio = (2 * BM * BN * BK) / ((BM * BK + BK * BN) * sizeof(T))(GEMM 類核:每乘加 2 FLOP,每塊載字節)。 - 施判則:
LDG 暫存變(比 ≥5 或 CUDA <11.0):
- LDG N+1 塊入暫存(非阻全域載)。
- 算於
buf[N % 2](與在飛 LDG 疊)。 __syncthreads()、STS 暫存入buf[(N+1) % 2]、__syncthreads()。- 簡也,無管 API 依。
- 增暫存壓:每束約
(BM * BK + BK * BN) / BLOCK_SIZE暫存。
cp.async (LDGSTS) 變(比 <5、CUDA ≥11.0):
__pipeline_memcpy_asyncN+1 塊直入buf[(N+1) % 2](異步、繞暫存檔)。__pipeline_commit()於算前。- 算於
buf[N % 2]。 __pipeline_wait_prior(0)+__syncthreads()於算後。- 疊優、預取無暫存壓。需
#include <cuda_pipeline.h>。
- 判限(GA104 IGEMM 4096x4096x4096 測):
- 比 <5:1——宜 cp.async(IGEMM 測 +35%)。
- 比 5-20:1——皆建、測擇。
- 比 >20:1——管恐無益(束交足)。
得:擇變、據算/載比與標架。
敗:比模糊(5-20:1)→皆建、測。CUDA 支則 cp.async 為穩擇。
三:重構 K-環
化序載-同-算-同環為管前-環-後構。
-
識三段:原環體分三:
- 前:載塊 0 入
buf[0]、同、入主環。 - 主環:塊 1 至
num_tiles - 1、疊載 N+1 與算 N。 - 後:算末塊(末主環次已載)。
- 前:載塊 0 入
-
LDG 暫存變構:
// === LDG-register variant ===
// Prologue: load tile 0 into buf[0]
cooperative_load_tile(smem_a[0], smem_b[0], global_a, global_b, /*k_offset=*/0);
__syncthreads();
for (int tile = 0; tile < num_tiles - 1; tile++) {
int cur_buf = tile & 1;
int next_buf = 1 - cur_buf;
// Phase 1: LDG next tile into registers (non-blocking)
float reg_a[ELEMS_PER_THREAD_A], reg_b[ELEMS_PER_THREAD_B];
prefetch_tile_to_registers(reg_a, reg_b, global_a, global_b,
(tile + 1) * BK);
// Phase 2: Compute on current buffer (overlaps with LDG flight)
tensor_core_mma(smem_a[cur_buf], smem_b[cur_buf], acc);
// Phase 3: Drain registers into next buffer
__syncthreads();
store_registers_to_smem(smem_a[next_buf], smem_b[next_buf],
reg_a, reg_b);
__syncthreads();
}
// Epilogue: compute last tile
tensor_core_mma(smem_a[(num_tiles - 1) & 1], smem_b[(num_tiles - 1) & 1], acc);
- cp.async 變構:
// === cp.async variant ===
#include <cuda_pipeline.h>
// Prologue: async load tile 0 into buf[0]
cpasync_load_tile(smem_a[0], smem_b[0], global_a, global_b, /*k_offset=*/0);
__pipeline_commit();
__pipeline_wait_prior(0);
__syncthreads();
for (int tile = 0; tile < num_tiles - 1; tile++) {
int cur_buf = tile & 1;
int next_buf = 1 - cur_buf;
// Phase 1: cp.async next tile into next buffer (async, direct to smem)
cpasync_load_tile(smem_a[next_buf], smem_b[next_buf],
global_a, global_b, (tile + 1) * BK);
__pipeline_commit();
// Phase 2: Compute on current buffer (overlaps with LDGSTS in flight)
tensor_core_mma(smem_a[cur_buf], smem_b[cur_buf], acc);
// Phase 3: Wait for async copies to complete
__pipeline_wait_prior(0);
__syncthreads();
}
// Epilogue: compute last tile
tensor_core_mma(smem_a[(num_tiles - 1) & 1], smem_b[(num_tiles - 1) & 1], acc);
- 驗環次:主環行
num_tiles - 1次(塊 0 至num_tiles - 2算、塊 1 至num_tiles - 1載)。後算末次所載塊。
得:重構源碼,前、主環、後段明於擇變。
敗:常誤為緩索差一或忘後算。驗:前載入 buf[0]、首主環次算 buf[0] 載入 buf[1]、次次算 buf[1] 載入 buf[0]、餘類推。後算 buf[(num_tiles - 1) & 1]。
四:實雙緩
宣雙緩共存、實載函。
- 易單緩共存宣為雙緩陣:
// Before (single buffer)
__shared__ half smem_a[BM * BK];
__shared__ half smem_b[BK * BN];
// After (double buffer)
__shared__ half smem_a[2][BM * BK];
__shared__ half smem_b[2][BK * BN];
- cp.async 變、用管 API 實異載函:
__device__ void cpasync_load_tile(half* dst_a, half* dst_b,
const half* src_a, const half* src_b,
int k_offset) {
// Each thread copies its portion (16 bytes = 8 half values per cp.async)
int tid = threadIdx.x;
int bytes_per_thread = 16; // cp.async.cg supports 4, 8, or 16 bytes
// A tile: BM * BK elements, distributed across BLOCK_SIZE threads
int elems_a = BM * BK / BLOCK_SIZE;
for (int i = 0; i < elems_a; i += 8) {
int idx = tid * elems_a + i;
__pipeline_memcpy_async(dst_a + idx,
src_a + k_offset * BM + idx,
bytes_per_thread);
}
// B tile: BK * BN elements, distributed similarly
int elems_b = BK * BN / BLOCK_SIZE;
for (int i = 0; i < elems_b; i += 8) {
int idx = tid * elems_b + i;
__pipeline_memcpy_async(dst_b + idx,
src_b + k_offset * BN + idx,
bytes_per_thread);
}
}
- LDG 變、實暫存暫陣與儲函:
// Declare register staging (size = elements per thread)
half reg_a[BM * BK / BLOCK_SIZE];
half reg_b[BK * BN / BLOCK_SIZE];
// Prefetch: LDG from global to registers (non-blocking, issued early)
for (int i = 0; i < BM * BK / BLOCK_SIZE; i++) {
int idx = threadIdx.x * (BM * BK / BLOCK_SIZE) + i;
reg_a[i] = global_a[k_offset * BM + idx];
}
// ... similarly for reg_b
// Store: STS from registers to shared memory (after __syncthreads)
for (int i = 0; i < BM * BK / BLOCK_SIZE; i++) {
int idx = threadIdx.x * (BM * BK / BLOCK_SIZE) + i;
smem_a[next_buf][idx] = reg_a[i];
}
- 留
__launch_bounds__(BLOCK_SIZE)於核以授譯器準占信息。 - 譯:
nvcc --cubin -arch=sm_86 -O2 -o kernel.sm_86.cubin kernel.cu。
得:可譯之核、雙緩共存與擇載機。cubin 成而無誤。
敗:管 API 譯敗→確 #include <cuda_pipeline.h> 在、CUDA ≥11.0。暫存溢出(察 nvcc --resource-usage)→增 BLOCK_SIZE 或減 BK 以縮暫存陣。
五:驗確
行管核對 CPU 參、確數同。
- 譯基準:
nvcc -arch=sm_86 -O2 -o bench bench.cu -lcuda -I../../phase2/common。 - 先小問題(512x512x512)以捉索誤、後擴。
- 施型對之容差:
- INT8 Tensor Core (IMMA):
abs=0.5, rel=0.1 - FP16 Tensor Core (HMMA):
abs=1e-2, rel=1e-2 - FP32 純 (FFMA):
abs=1e-3, rel=1e-3
- INT8 Tensor Core (IMMA):
- 管不變算——只重序載。確敗→誤在緩索、非算邏。
- 試於標問題(如 4096x4096x4096)以驗邊。
得:小、標問題皆 PASS,誤界同未管基準。
敗:緩索差最常因。驗:算讀 buf[tile & 1]、載寫 buf[1 - (tile & 1)]。確後處 (num_tiles - 1) & 1、非 num_tiles & 1。cp.async:驗 __pipeline_wait_prior(0) 完於 __syncthreads() 前——否則算讀半寫資料。
六:基測比
於標問題量管核對未管基準。
- 行未管基準、記 GFLOPS 或帶寬(按核型)。
- 行各管變、記同度。
- 算速比:
speedup = pipelined_metric / baseline_metric。 - 按算/載比預期之得(GA104 測):
- 低比(<5:1):cp.async +15-35%(IGEMM 測:LDG +18%、cp.async +35% 於 4096x4096x4096)。
- 中比(5-20:1):+5-15%。
- 高比(>20:1):0-5% 或退步。
- 若皆建、擇速者用之。
| Variant | GFLOPS | Speedup vs Baseline |
|------------------|--------|---------------------|
| Baseline | XXX | 1.00x |
| LDG-register | XXX | X.XXx |
| cp.async (LDGSTS)| XXX | X.XXx |
得:效比表示進。擇變當示測速、合算/載比預。
敗:效退→察三:(1) SASS 有未期指令過載(多 BAR.SYNC、暫存溢)。(2) 共存未過占崖——nvcc --resource-usage 或 cuobjdump -res-usage 驗。(3) 問題大小生足塊(K / BK >= 4)以攤前後過載。
七:驗 SASS 疊
察譯 SASS、確全域載與 Tensor Core 指令於主環體疊。
- 反譯:
cuobjdump -sass kernel.sm_86.cubin | grep -E 'IMMA|HMMA|LDGSTS|LDG|BAR'。 - 主環體、驗此序:
LDGSTS或LDG於HMMA或IMMA前。- 載與算間無
BAR.SYNC(須束調度可疊)。 BAR.SYNC於算後、閘下次用所載資料。
- 察 HMMA/IMMA 滯碼——HMMA S08(管延)期之必然。IMMA S01-S04 常。LDG/LDGSTS 滯當低(S01)、束調度可於載飛換算。
- 數每環次 HMMA/IMMA 指令——當合未管版(管不變算量)。
# Full SASS pipeline verification
cuobjdump -sass kernel.sm_86.cubin | grep -E 'IMMA|HMMA|LDGSTS|LDG|BAR'
# Count compute instructions per loop
cuobjdump -sass kernel.sm_86.cubin | grep -c 'HMMA\|IMMA'
# Check for register spills
nvcc --resource-usage --cubin -arch=sm_86 -O2 kernel.cu 2>&1 | grep -i spill
得:SASS 注釋示載-前-算式、無中閘。零暫存溢。
敗:譯器重序載於算後(破疊)→試:(1) #pragma unroll 1 於主環防展開過。(2) 分載算為獨函以授序提示。(3) asm volatile("" ::: "memory") 為譯器界於載算間(末計——恐抑他優)。
驗
- 雙緩共存於架崖下(GA104:每塊 50 KB)
- 二緩交用(
buf[tile & 1]式) - 前載塊 0 入
buf[0] - 後算末塊自
buf[(num_tiles - 1) & 1] - 對 CPU 參、小、標皆 PASS
- SASS 證載算疊(LDGSTS/LDG 與 IMMA/HMMA 間無
BAR.SYNC) - 效進於未管基準
- LDG 變無暫存溢(察
nvcc --resource-usage)
忌
- 雙緩過共存崖——GA104 崖為每塊 50 KB、非 64 KB。必先算
smem_doubled。28 KB 單緩之核倍為 56 KB、過崖、占減半。可使 +20% 管得轉 -50% 占退。 - 忘後算——末主環次所載末塊需自身算於環外。無之、末 BK 列默落、生誤果似小數誤非顯敗。
- 緩索差一——當前算用
buf[tile & 1]、下載用buf[1 - (tile & 1)]。常誤用buf[(tile + 1) & 1]為下緩、緩數 2 時等buf[1 - (tile & 1)]、誤入算索則讀錯。 - cp.async commit/wait 序——
__pipeline_commit()必呼於算 BEFORE(封異拷批)。__pipeline_wait_prior(0)必呼於算 AFTER(阻至所封拷完)。互換則異拷變同、疊益盡失。 - 缺 __syncthreads——LDG 變、算與 STS 排間需
__syncthreads()(算讀畢當前緩、後覆寫)。STS 排後另需__syncthreads()(諸束寫畢、後次讀)。cp.async 變、__pipeline_wait_prior(0)後__syncthreads()確諸束見完拷。 - cp.async 邊處——
__pipeline_memcpy_async需源址有效對齊。陣邊K非BK倍、末塊恐越界。退用純載含界檢於末塊、或補入陣至 BK 倍。
參
analyze-kernel-bottleneck— 識核為憶束乎、算驅變擇之算/載比
GitHub Repository
Related Skills
llamaguard
OtherLlamaGuard is Meta's 7-8B parameter model for moderating LLM inputs and outputs across six safety categories like violence and hate speech. It offers 94-95% accuracy and can be deployed using vLLM, Hugging Face, or Amazon SageMaker. Use this skill to easily integrate content filtering and safety guardrails into your AI applications.
cost-optimization
OtherThis Claude Skill helps developers optimize cloud costs through resource rightsizing, tagging strategies, and spending analysis. It provides a framework for reducing cloud expenses and implementing cost governance across AWS, Azure, and GCP. Use it when you need to analyze infrastructure costs, right-size resources, or meet budget constraints.
quantizing-models-bitsandbytes
OtherThis skill quantizes LLMs to 8-bit or 4-bit precision using bitsandbytes, achieving 50-75% memory reduction with minimal accuracy loss. It's ideal for running larger models on limited GPU memory or accelerating inference, supporting formats like INT8, NF4, and FP4. The skill integrates with HuggingFace Transformers and enables QLoRA training and 8-bit optimizers.
dispatching-parallel-agents
OtherThis Claude Skill dispatches multiple agents to investigate and fix 3+ independent problems concurrently. It is designed for scenarios involving unrelated failures that can be resolved without shared state or dependencies. The core capability is parallel problem-solving, assigning one agent per independent problem domain to maximize efficiency.
