SKILL·5918BA

pipeline-gpu-kernel

Name: pipeline-gpu-kernel
Author: pjt222

pjt222

업데이트됨 1 month ago

8 조회

기타api

정보

이 스킬은 타일링된 GPU 커널을 재구성하여 소프트웨어 파이프라이닝을 구현하고, 글로벌 메모리 로드와 Tensor Core 연산을 중첩 실행할 수 있도록 합니다. 컴퓨트/로드 비율에 기반하여 LDG-레지스터와 cp.async 방식을 선택하고, 점유율 한계에 맞춰 공유 메모리를 관리하는 방법을 개발자에게 안내합니다. 프롤로그, 메인 루프, 에필로그를 재구성하여 커널을 최적화하고 SASS 수준에서 검증할 때 사용하세요.

빠른 설치

Claude Code

문서

Pipeline GPU Kernel

Double-buffer tiled GPU kernel → tile N+1 global load overlaps tile N Tensor Core compute. Sequential load-sync-compute-sync K-loop → prologue/loop/epilogue. Pick LDG-reg vs cp.async (LDGSTS) by compute/load ratio. Verify smem under arch cliff. Confirm overlap in SASS.

Use When

analyze-kernel-bottleneck flags mem-bound kernel, low compute/load per tile
Warp interleave alone can't hide DRAM latency (~300 cyc GA104)
Sequential load-sync-compute-sync K-loop → restructurable
Skip → ratio >20:1 + 8+ warps active

In

Required: CUDA kernel .cu w/ tiled K-loop, separate load + compute phases
Required: GPU arch (e.g., GA104 / sm_86 → smem cliff + occupancy)
Required: Tile sizes (BM, BN, BK) + dtype (FP16, FP32, INT8)
Optional: Compute/load ratio per tile (from analyze-kernel-bottleneck)
Optional: Baseline (non-pipelined perf at target size)

Do

Step 1: Verify Preconditions

K-loop has load + compute phases split by __syncthreads(). Calc doubled smem cost vs arch cliff.

Locate K-loop. Structure: load A+B tiles global→smem, __syncthreads(), compute (HMMA/IMMA/FFMA) on smem tiles, __syncthreads().
Single-buffer smem: smem_a_size = BM * BK * sizeof(T), smem_b_size = BK * BN * sizeof(T).
Double-buffer cost: smem_doubled = smem_a_size * 2 + smem_b_size * 2.
Vs arch cliff. GA104 (sm_86): 100 KB max smem/SM, cliff 50 KB/block (>50 KB = 1 block/SM = 4 warps, 2x occupancy collapse).

Single buffer: smem_a[BM*BK] + smem_b[BK*BN] = 2 KB + 2 KB = 4 KB
Double buffer: smem_a[2][BM*BK] + smem_b[2][BK*BN] = 4 KB + 4 KB = 8 KB
8 KB << 50 KB cliff -> 2 blocks/SM -> 8 warps

Loop count: num_tiles = K / BK. Pipelining needs num_tiles >= 2.

→ Smem budget table: single + double cost, doubled under cliff, ≥2 blocks/SM.

If err: doubled > cliff → halve BK or BM until smem_doubled <= 50 KB GA104. Or use reg-only prefetch (LDG variant), no smem doubling — stage in regs, write same single buffer after __syncthreads().

Step 2: Choose Variant

LDG-reg vs cp.async (LDGSTS) by compute/load ratio per tile.

Ratio: ratio = (2 * BM * BN * BK) / ((BM * BK + BK * BN) * sizeof(T)) (GEMM-like: 2 FLOPs/MAD, bytes/tile).
Decide:

LDG-register (ratio >= 5 or CUDA < 11.0):

LDG tile N+1 → regs (non-blocking global loads).
Compute on buf[N % 2] (overlaps outstanding LDGs).
__syncthreads(), STS regs → buf[(N+1) % 2], __syncthreads().
Simpler, no pipeline API dep.
Reg pressure: ~(BM * BK + BK * BN) / BLOCK_SIZE regs/thread for staging.

cp.async (LDGSTS) (ratio < 5, CUDA >= 11.0):

__pipeline_memcpy_async tile N+1 → buf[(N+1) % 2] (async, bypass reg file).
__pipeline_commit() before compute.
Compute on buf[N % 2].
__pipeline_wait_prior(0) + __syncthreads() after compute.
Better overlap, zero reg pressure for prefetch. Needs #include <cuda_pipeline.h>.

Thresholds (GA104 IGEMM 4096x4096x4096):
- <5:1 → cp.async (+35% on IGEMM).
- 5-20:1 → impl both, bench.
- 20:1 → likely no gain (warp interleave enough).

→ Variant + justification (ratio + arch).

If err: ambiguous (5-20:1) → impl both, bench. cp.async = safer default if CUDA supports.

Step 3: Restructure K-Loop

Sequential load-sync-compute-sync → prologue/loop/epilogue.

Three sections:
- Prologue: load tile 0 → buf[0], sync, enter loop.
- Main loop: tiles 1 to num_tiles - 1, overlap load N+1 w/ compute N.
- Epilogue: compute last tile (loaded by final main iter).
LDG-register:

// === LDG-register variant ===
// Prologue: load tile 0 into buf[0]
cooperative_load_tile(smem_a[0], smem_b[0], global_a, global_b, /*k_offset=*/0);
__syncthreads();

for (int tile = 0; tile < num_tiles - 1; tile++) {
    int cur_buf = tile & 1;
    int next_buf = 1 - cur_buf;

    // Phase 1: LDG next tile into registers (non-blocking)
    float reg_a[ELEMS_PER_THREAD_A], reg_b[ELEMS_PER_THREAD_B];
    prefetch_tile_to_registers(reg_a, reg_b, global_a, global_b,
                               (tile + 1) * BK);

    // Phase 2: Compute on current buffer (overlaps with LDG flight)
    tensor_core_mma(smem_a[cur_buf], smem_b[cur_buf], acc);

    // Phase 3: Drain registers into next buffer
    __syncthreads();
    store_registers_to_smem(smem_a[next_buf], smem_b[next_buf],
                            reg_a, reg_b);
    __syncthreads();
}

// Epilogue: compute last tile
tensor_core_mma(smem_a[(num_tiles - 1) & 1], smem_b[(num_tiles - 1) & 1], acc);

cp.async:

// === cp.async variant ===
#include <cuda_pipeline.h>

// Prologue: async load tile 0 into buf[0]
cpasync_load_tile(smem_a[0], smem_b[0], global_a, global_b, /*k_offset=*/0);
__pipeline_commit();
__pipeline_wait_prior(0);
__syncthreads();

for (int tile = 0; tile < num_tiles - 1; tile++) {
    int cur_buf = tile & 1;
    int next_buf = 1 - cur_buf;

    // Phase 1: cp.async next tile into next buffer (async, direct to smem)
    cpasync_load_tile(smem_a[next_buf], smem_b[next_buf],
                      global_a, global_b, (tile + 1) * BK);
    __pipeline_commit();

    // Phase 2: Compute on current buffer (overlaps with LDGSTS in flight)
    tensor_core_mma(smem_a[cur_buf], smem_b[cur_buf], acc);

    // Phase 3: Wait for async copies to complete
    __pipeline_wait_prior(0);
    __syncthreads();
}

// Epilogue: compute last tile
tensor_core_mma(smem_a[(num_tiles - 1) & 1], smem_b[(num_tiles - 1) & 1], acc);

Loop count: main loop runs num_tiles - 1 iters. Epilogue computes tile from last iter.

→ Restructured K-loop w/ clear prologue, loop, epilogue.

If err: most common bug → off-by-one buf index or skipped epilogue. Verify: prologue → buf[0], first iter compute buf[0] + load buf[1], second compute buf[1] + load buf[0], etc. Epilogue → buf[(num_tiles - 1) & 1].

Step 4: Implement Double-Buffer

Declare double-buffered smem + load fns.

Single → double:

// Before (single buffer)
__shared__ half smem_a[BM * BK];
__shared__ half smem_b[BK * BN];

// After (double buffer)
__shared__ half smem_a[2][BM * BK];
__shared__ half smem_b[2][BK * BN];

cp.async load fn (pipeline API):

__device__ void cpasync_load_tile(half* dst_a, half* dst_b,
                                  const half* src_a, const half* src_b,
                                  int k_offset) {
    // Each thread copies its portion (16 bytes = 8 half values per cp.async)
    int tid = threadIdx.x;
    int bytes_per_thread = 16;  // cp.async.cg supports 4, 8, or 16 bytes

    // A tile: BM * BK elements, distributed across BLOCK_SIZE threads
    int elems_a = BM * BK / BLOCK_SIZE;
    for (int i = 0; i < elems_a; i += 8) {
        int idx = tid * elems_a + i;
        __pipeline_memcpy_async(dst_a + idx,
                                src_a + k_offset * BM + idx,
                                bytes_per_thread);
    }

    // B tile: BK * BN elements, distributed similarly
    int elems_b = BK * BN / BLOCK_SIZE;
    for (int i = 0; i < elems_b; i += 8) {
        int idx = tid * elems_b + i;
        __pipeline_memcpy_async(dst_b + idx,
                                src_b + k_offset * BN + idx,
                                bytes_per_thread);
    }
}

LDG variant: reg staging arrays + store fns:

// Declare register staging (size = elements per thread)
half reg_a[BM * BK / BLOCK_SIZE];
half reg_b[BK * BN / BLOCK_SIZE];

// Prefetch: LDG from global to registers (non-blocking, issued early)
for (int i = 0; i < BM * BK / BLOCK_SIZE; i++) {
    int idx = threadIdx.x * (BM * BK / BLOCK_SIZE) + i;
    reg_a[i] = global_a[k_offset * BM + idx];
}
// ... similarly for reg_b

// Store: STS from registers to shared memory (after __syncthreads)
for (int i = 0; i < BM * BK / BLOCK_SIZE; i++) {
    int idx = threadIdx.x * (BM * BK / BLOCK_SIZE) + i;
    smem_a[next_buf][idx] = reg_a[i];
}

Keep __launch_bounds__(BLOCK_SIZE) → accurate occupancy info to compiler.
Compile: nvcc --cubin -arch=sm_86 -O2 -o kernel.sm_86.cubin kernel.cu.

→ Compilable kernel, double-buffered smem, chosen load mech. Cubin gen no errors.

If err: pipeline API fail → #include <cuda_pipeline.h> + CUDA >= 11.0. Reg spills (nvcc --resource-usage) → shrink reg staging via larger BLOCK_SIZE or smaller BK.

Step 5: Verify Correctness

Pipelined kernel vs CPU ref → identical numerical out.

Compile bench: nvcc -arch=sm_86 -O2 -o bench bench.cu -lcuda -I../../phase2/common.
Small problem first (512x512x512) → catch index bugs before scale.
Tolerance per dtype:
- INT8 Tensor Core (IMMA): abs=0.5, rel=0.1
- FP16 Tensor Core (HMMA): abs=1e-2, rel=1e-2
- FP32 scalar (FFMA): abs=1e-3, rel=1e-3
Pipelining doesn't change arithmetic — reorders loads. Fail → bug in buf index, not compute.
Test target size (e.g., 4096x4096x4096) → verify boundary handling.

→ PASS at small + target sizes, error bounds = non-pipelined baseline.

If err: buf index bug = top suspect. Verify: compute reads buf[tile & 1], loads write buf[1 - (tile & 1)]. Epilogue uses (num_tiles - 1) & 1, not num_tiles & 1. cp.async → __pipeline_wait_prior(0) before __syncthreads(), else compute reads partial.

Step 6: Benchmark + Compare

Pipelined vs non-pipelined baseline at target size.

Run baseline → record GFLOPS or bandwidth.
Run each pipelined variant → same metric.
Speedup: speedup = pipelined_metric / baseline_metric.
Expected gains by ratio (GA104):
- Low (<5:1): +15-35% from cp.async (IGEMM: LDG +18%, cp.async +35% at 4096x4096x4096).
- Med (5-20:1): +5-15%.
- High (>20:1): 0-5% or regress.
Both impl → pick faster for prod.

| Variant          | GFLOPS | Speedup vs Baseline |
|------------------|--------|---------------------|
| Baseline         | XXX    | 1.00x               |
| LDG-register     | XXX    | X.XXx               |
| cp.async (LDGSTS)| XXX    | X.XXx               |

→ Perf table showing improvement. Chosen variant → measurable speedup matching ratio prediction.

If err: regress → check 3: (1) SASS for unexpected overhead (extra BAR.SYNC, reg spills). (2) Smem didn't cross cliff — nvcc --resource-usage or cuobjdump -res-usage. (3) Enough tiles (K / BK >= 4) → amortize prologue/epilogue.

Step 7: Verify SASS Overlap

Inspect SASS → global loads + Tensor Core overlap in main loop.

Disassemble: cuobjdump -sass kernel.sm_86.cubin | grep -E 'IMMA|HMMA|LDGSTS|LDG|BAR'.
Main loop pattern:
- LDGSTS or LDG before HMMA or IMMA.
- No BAR.SYNC between loads + compute (must overlap in warp scheduler).
- BAR.SYNC after compute → gates next iter's use of loaded data.
Stall codes on HMMA/IMMA: S08 HMMA pipeline delay = expected. S01-S04 IMMA = normal. LDG/LDGSTS stalls low (S01) → scheduler switches to compute while loads in flight.
Count HMMA/IMMA per iter → should match non-pipelined (pipelining ≠ compute volume change).

# Full SASS pipeline verification
cuobjdump -sass kernel.sm_86.cubin | grep -E 'IMMA|HMMA|LDGSTS|LDG|BAR'

# Count compute instructions per loop
cuobjdump -sass kernel.sm_86.cubin | grep -c 'HMMA\|IMMA'

# Check for register spills
nvcc --resource-usage --cubin -arch=sm_86 -O2 kernel.cu 2>&1 | grep -i spill

→ SASS shows load-before-compute, no intervening barriers. Zero reg spills.

If err: compiler reordered loads after compute (overlap defeated) → (1) #pragma unroll 1 on main loop → no over-aggressive unroll. (2) Split load + compute into distinct inline fns → sequencing hint. (3) asm volatile("" ::: "memory") as compiler fence (last resort, may inhibit other opts).

Check

Double-buffer smem under arch cliff (GA104: 50 KB/block)
Both buffers alternate (buf[tile & 1])
Prologue → tile 0 in buf[0]
Epilogue → compute buf[(num_tiles - 1) & 1]
Correctness PASS vs CPU ref at small + target
SASS confirms overlap (no BAR.SYNC between LDGSTS/LDG + IMMA/HMMA)
Perf > non-pipelined baseline
No reg spill (LDG variant) — check nvcc --resource-usage

Traps

Cross smem cliff via doubling — GA104 cliff 50 KB/block, not 64. Always calc smem_doubled before impl. 28 KB single → 56 KB doubled crosses cliff, halves occupancy. +20% pipelining gain → -50% occupancy regress.
Skip epilogue compute — Last tile loaded in final iter needs own compute outside loop. Without → last BK cols of K silently dropped → incorrect results, may look like small numerical noise not obvious fail.
Buf index off-by-one — Use buf[tile & 1] for compute, buf[1 - (tile & 1)] for next load. Common err: buf[(tile + 1) & 1] for next = same as buf[1 - (tile & 1)] only when 2 buffers — wrong if applied to compute index.
cp.async commit/wait order — __pipeline_commit() BEFORE compute (seals async batch). __pipeline_wait_prior(0) AFTER compute (blocks until copies done). Swap → async becomes synchronous, kills overlap.
Missing __syncthreads — LDG variant: __syncthreads() between compute + STS drain (compute finishes reading current buf before overwrite). Another after STS drain (all threads done writing before next iter reads). cp.async: __syncthreads() after __pipeline_wait_prior(0) → all threads see completed copies.
Boundary in cp.async — __pipeline_memcpy_async needs valid + aligned src. Matrix edges where K not multiple of BK → last tile reads OOB. Fall back to scalar loads w/ bounds check for final, or pad inputs to BK multiple.

→

analyze-kernel-bottleneck — identify mem-bound, calc compute/load ratio for variant pick

GitHub 저장소

pjt222/agent-almanac

경로: i18n/caveman-ultra/skills/pipeline-gpu-kernel

agentsagentskillsai-assisted-developmentclaude-codeskillsteams

FAQ

Frequently asked questions

What is the pipeline-gpu-kernel skill?

pipeline-gpu-kernel is a Claude Skill by pjt222. Skills package instructions and resources that Claude loads on demand, so Claude can perform pipeline-gpu-kernel-related tasks without extra prompting.

How do I install pipeline-gpu-kernel?

Use the install commands on this page: add pipeline-gpu-kernel to Claude Code as a plugin, or clone its repository into your skills directory, then restart Claude so it picks up the skill.

What category does pipeline-gpu-kernel belong to?

pipeline-gpu-kernel is in the Other category, tagged api.

Is pipeline-gpu-kernel free to use?

Yes. pipeline-gpu-kernel is listed on AIMCP and free to install. It runs inside Claude, so no separate service account is required to use the skill itself.

연관 스킬

llamaguard

기타

LlamaGuard는 폭력 및 혐오 발언 등 6가지 안전 범주에서 LLM 입력과 출력을 조정하기 위한 Meta의 70-80억 파라미터 모델입니다. 94-95% 정확도를 제공하며 vLLM, Hugging Face 또는 Amazon SageMaker를 사용해 배포할 수 있습니다. 이 기술을 사용하여 AI 애플리케이션에 콘텐츠 필터링 및 안전 가드레일을 손쉽게 통합하세요.

스킬 보기

cost-optimization

기타

이 Claude Skill은 리소스 적정화, 태깅 전략, 지출 분석을 통해 개발자들이 클라우드 비용을 최적화할 수 있도록 지원합니다. AWS, Azure, GCP에서 클라우드 비용을 절감하고 비용 거버넌스를 구현하기 위한 프레임워크를 제공합니다. 인프라 비용을 분석하거나, 리소스를 적정화하거나, 예산 제약을 충족해야 할 때 사용하세요.

스킬 보기

sports-betting-analyzer

기타

이 Claude Skill은 스프레드, 오버/언더, 프로프 베트를 포함한 스포츠 베팅 시장을 분석합니다. 역사적 추이와 상황별 통계를 검토하여 가치 베트를 발견하고, 교육적 목적으로 실행 가능한 권장 사항이 담긴 구조화된 마크다운 결과를 제공합니다. 개발자는 이 기능을 스포츠 베팅 분석 도구에 활용할 수 있으며, 단순히 엔터테인먼트/교육 목적으로만 설계되었음을 유의해야 합니다.

스킬 보기

quantizing-models-bitsandbytes

기타

이 스킬은 bitsandbytes를 사용하여 LLM을 8비트 또는 4비트 정밀도로 양자화하며, 최소한의 정확도 손실로 50-75%의 메모리 감소를 달성합니다. 제한된 GPU 메모리에서 더 큰 모델을 실행하거나 추론을 가속화하는 데 이상적이며, INT8, NF4, FP4와 같은 형식을 지원합니다. 이 스킬은 HuggingFace Transformers와 통합되어 QLoRA 학습 및 8비트 옵티마이저를 가능하게 합니다.

스킬 보기