MCP HubMCP Hub
スキル一覧に戻る

tensorrt-llm

zechenzhangAGI
更新日 2 days ago
163 閲覧
62
2
62
GitHubで表示
その他ai

について

TensorRT-LLMは、NVIDIA GPU上で大規模言語モデル(LLM)の推論を最大スループットと最低レイテンシに向けて最適化するNVIDIA製ライブラリです。量子化やマルチGPUスケーリングなどの機能をサポートし、PyTorch比10~100倍高速な性能が求められるプロダクション環境での導入に最適です。NVIDIAハードウェア上で最高のパフォーマンスが必要な場合にご利用ください。よりシンプルな設定の場合はvLLMを、CPUまたはApple Siliconの場合はllama.cppなどの代替手段を選択することをお勧めします。

クイックインストール

Claude Code

推奨
プラグインコマンド推奨
/plugin add https://github.com/zechenzhangAGI/AI-research-SKILLs
Git クローン代替
git clone https://github.com/zechenzhangAGI/AI-research-SKILLs.git ~/.claude/skills/tensorrt-llm

このコマンドをClaude Codeにコピー&ペーストしてスキルをインストールします

ドキュメント

TensorRT-LLM

NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs.

When to use TensorRT-LLM

Use TensorRT-LLM when:

  • Deploying on NVIDIA GPUs (A100, H100, GB200)
  • Need maximum throughput (24,000+ tokens/sec on Llama 3)
  • Require low latency for real-time applications
  • Working with quantized models (FP8, INT4, FP4)
  • Scaling across multiple GPUs or nodes

Use vLLM instead when:

  • Need simpler setup and Python-first API
  • Want PagedAttention without TensorRT compilation
  • Working with AMD GPUs or non-NVIDIA hardware

Use llama.cpp instead when:

  • Deploying on CPU or Apple Silicon
  • Need edge deployment without NVIDIA GPUs
  • Want simpler GGUF quantization format

Quick start

Installation

# Docker (recommended)
docker pull nvidia/tensorrt_llm:latest

# pip install
pip install tensorrt_llm==1.2.0rc3

# Requires CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12

Basic inference

from tensorrt_llm import LLM, SamplingParams

# Initialize model
llm = LLM(model="meta-llama/Meta-Llama-3-8B")

# Configure sampling
sampling_params = SamplingParams(
    max_tokens=100,
    temperature=0.7,
    top_p=0.9
)

# Generate
prompts = ["Explain quantum computing"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.text)

Serving with trtllm-serve

# Start server (automatic model download and compilation)
trtllm-serve meta-llama/Meta-Llama-3-8B \
    --tp_size 4 \              # Tensor parallelism (4 GPUs)
    --max_batch_size 256 \
    --max_num_tokens 4096

# Client request
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-8B",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Key features

Performance optimizations

  • In-flight batching: Dynamic batching during generation
  • Paged KV cache: Efficient memory management
  • Flash Attention: Optimized attention kernels
  • Quantization: FP8, INT4, FP4 for 2-4× faster inference
  • CUDA graphs: Reduced kernel launch overhead

Parallelism

  • Tensor parallelism (TP): Split model across GPUs
  • Pipeline parallelism (PP): Layer-wise distribution
  • Expert parallelism: For Mixture-of-Experts models
  • Multi-node: Scale beyond single machine

Advanced features

  • Speculative decoding: Faster generation with draft models
  • LoRA serving: Efficient multi-adapter deployment
  • Disaggregated serving: Separate prefill and generation

Common patterns

Quantized model (FP8)

from tensorrt_llm import LLM

# Load FP8 quantized model (2× faster, 50% memory)
llm = LLM(
    model="meta-llama/Meta-Llama-3-70B",
    dtype="fp8",
    max_num_tokens=8192
)

# Inference same as before
outputs = llm.generate(["Summarize this article..."])

Multi-GPU deployment

# Tensor parallelism across 8 GPUs
llm = LLM(
    model="meta-llama/Meta-Llama-3-405B",
    tensor_parallel_size=8,
    dtype="fp8"
)

Batch inference

# Process 100 prompts efficiently
prompts = [f"Question {i}: ..." for i in range(100)]

outputs = llm.generate(
    prompts,
    sampling_params=SamplingParams(max_tokens=200)
)

# Automatic in-flight batching for maximum throughput

Performance benchmarks

Meta Llama 3-8B (H100 GPU):

  • Throughput: 24,000 tokens/sec
  • Latency: ~10ms per token
  • vs PyTorch: 100× faster

Llama 3-70B (8× A100 80GB):

  • FP8 quantization: 2× faster than FP16
  • Memory: 50% reduction with FP8

Supported models

  • LLaMA family: Llama 2, Llama 3, CodeLlama
  • GPT family: GPT-2, GPT-J, GPT-NeoX
  • Qwen: Qwen, Qwen2, QwQ
  • DeepSeek: DeepSeek-V2, DeepSeek-V3
  • Mixtral: Mixtral-8x7B, Mixtral-8x22B
  • Vision: LLaVA, Phi-3-vision
  • 100+ models on HuggingFace

References

Resources

GitHub リポジトリ

zechenzhangAGI/AI-research-SKILLs
パス: 12-inference-serving/tensorrt-llm
aiai-researchclaudeclaude-codeclaude-skillscodex

関連スキル

evaluating-llms-harness

テスト

This Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.

スキルを見る

sglang

メタ

SGLang is a high-performance LLM serving framework that specializes in fast, structured generation for JSON, regex, and agentic workflows using its RadixAttention prefix caching. It delivers significantly faster inference, especially for tasks with repeated prefixes, making it ideal for complex, structured outputs and multi-turn conversations. Choose SGLang over alternatives like vLLM when you need constrained decoding or are building applications with extensive prefix sharing.

スキルを見る

cloudflare-turnstile

メタ

This skill provides comprehensive guidance for implementing Cloudflare Turnstile as a CAPTCHA-alternative bot protection system. It covers integration for forms, login pages, API endpoints, and frameworks like React/Next.js/Hono, while handling invisible challenges that maintain user experience. Use it when migrating from reCAPTCHA, debugging error codes, or implementing token validation and E2E tests.

スキルを見る

langchain

メタ

LangChain is a framework for building LLM applications using agents, chains, and RAG pipelines. It supports multiple LLM providers, offers 500+ integrations, and includes features like tool calling and memory management. Use it for rapid prototyping and deploying production systems like chatbots, autonomous agents, and question-answering services.

スキルを見る