serving-llms-vllm
About
This Claude Skill deploys high-throughput LLM APIs using vLLM's PagedAttention and continuous batching, ideal for production serving with optimized latency and GPU memory efficiency. It provides OpenAI-compatible endpoints while supporting quantization methods and tensor parallelism. Use it when you need to serve models with limited GPU resources or maximize inference throughput.
Quick Install
Claude Code
Recommended/plugin add https://github.com/davila7/claude-code-templatesgit clone https://github.com/davila7/claude-code-templates.git ~/.claude/skills/serving-llms-vllmCopy and paste this command in Claude Code to install this skill
Documentation
vLLM - High-Performance LLM Serving
Quick start
vLLM achieves 24x higher throughput than standard transformers through PagedAttention (block-based KV cache) and continuous batching (mixing prefill/decode requests).
Installation:
pip install vllm
Basic offline inference:
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3-8B-Instruct")
sampling = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["Explain quantum computing"], sampling)
print(outputs[0].outputs[0].text)
OpenAI-compatible server:
vllm serve meta-llama/Llama-3-8B-Instruct
# Query with OpenAI SDK
python -c "
from openai import OpenAI
client = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY')
print(client.chat.completions.create(
model='meta-llama/Llama-3-8B-Instruct',
messages=[{'role': 'user', 'content': 'Hello!'}]
).choices[0].message.content)
"
Common workflows
Workflow 1: Production API deployment
Copy this checklist and track progress:
Deployment Progress:
- [ ] Step 1: Configure server settings
- [ ] Step 2: Test with limited traffic
- [ ] Step 3: Enable monitoring
- [ ] Step 4: Deploy to production
- [ ] Step 5: Verify performance metrics
Step 1: Configure server settings
Choose configuration based on your model size:
# For 7B-13B models on single GPU
vllm serve meta-llama/Llama-3-8B-Instruct \
--gpu-memory-utilization 0.9 \
--max-model-len 8192 \
--port 8000
# For 30B-70B models with tensor parallelism
vllm serve meta-llama/Llama-2-70b-hf \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 \
--quantization awq \
--port 8000
# For production with caching and metrics
vllm serve meta-llama/Llama-3-8B-Instruct \
--gpu-memory-utilization 0.9 \
--enable-prefix-caching \
--enable-metrics \
--metrics-port 9090 \
--port 8000 \
--host 0.0.0.0
Step 2: Test with limited traffic
Run load test before production:
# Install load testing tool
pip install locust
# Create test_load.py with sample requests
# Run: locust -f test_load.py --host http://localhost:8000
Verify TTFT (time to first token) < 500ms and throughput > 100 req/sec.
Step 3: Enable monitoring
vLLM exposes Prometheus metrics on port 9090:
curl http://localhost:9090/metrics | grep vllm
Key metrics to monitor:
vllm:time_to_first_token_seconds- Latencyvllm:num_requests_running- Active requestsvllm:gpu_cache_usage_perc- KV cache utilization
Step 4: Deploy to production
Use Docker for consistent deployment:
# Run vLLM in Docker
docker run --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3-8B-Instruct \
--gpu-memory-utilization 0.9 \
--enable-prefix-caching
Step 5: Verify performance metrics
Check that deployment meets targets:
- TTFT < 500ms (for short prompts)
- Throughput > target req/sec
- GPU utilization > 80%
- No OOM errors in logs
Workflow 2: Offline batch inference
For processing large datasets without server overhead.
Copy this checklist:
Batch Processing:
- [ ] Step 1: Prepare input data
- [ ] Step 2: Configure LLM engine
- [ ] Step 3: Run batch inference
- [ ] Step 4: Process results
Step 1: Prepare input data
# Load prompts from file
prompts = []
with open("prompts.txt") as f:
prompts = [line.strip() for line in f]
print(f"Loaded {len(prompts)} prompts")
Step 2: Configure LLM engine
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3-8B-Instruct",
tensor_parallel_size=2, # Use 2 GPUs
gpu_memory_utilization=0.9,
max_model_len=4096
)
sampling = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=512,
stop=["</s>", "\n\n"]
)
Step 3: Run batch inference
vLLM automatically batches requests for efficiency:
# Process all prompts in one call
outputs = llm.generate(prompts, sampling)
# vLLM handles batching internally
# No need to manually chunk prompts
Step 4: Process results
# Extract generated text
results = []
for output in outputs:
prompt = output.prompt
generated = output.outputs[0].text
results.append({
"prompt": prompt,
"generated": generated,
"tokens": len(output.outputs[0].token_ids)
})
# Save to file
import json
with open("results.jsonl", "w") as f:
for result in results:
f.write(json.dumps(result) + "\n")
print(f"Processed {len(results)} prompts")
Workflow 3: Quantized model serving
Fit large models in limited GPU memory.
Quantization Setup:
- [ ] Step 1: Choose quantization method
- [ ] Step 2: Find or create quantized model
- [ ] Step 3: Launch with quantization flag
- [ ] Step 4: Verify accuracy
Step 1: Choose quantization method
- AWQ: Best for 70B models, minimal accuracy loss
- GPTQ: Wide model support, good compression
- FP8: Fastest on H100 GPUs
Step 2: Find or create quantized model
Use pre-quantized models from HuggingFace:
# Search for AWQ models
# Example: TheBloke/Llama-2-70B-AWQ
Step 3: Launch with quantization flag
# Using pre-quantized model
vllm serve TheBloke/Llama-2-70B-AWQ \
--quantization awq \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.95
# Results: 70B model in ~40GB VRAM
Step 4: Verify accuracy
Test outputs match expected quality:
# Compare quantized vs non-quantized responses
# Verify task-specific performance unchanged
When to use vs alternatives
Use vLLM when:
- Deploying production LLM APIs (100+ req/sec)
- Serving OpenAI-compatible endpoints
- Limited GPU memory but need large models
- Multi-user applications (chatbots, assistants)
- Need low latency with high throughput
Use alternatives instead:
- llama.cpp: CPU/edge inference, single-user
- HuggingFace transformers: Research, prototyping, one-off generation
- TensorRT-LLM: NVIDIA-only, need absolute maximum performance
- Text-Generation-Inference: Already in HuggingFace ecosystem
Common issues
Issue: Out of memory during model loading
Reduce memory usage:
vllm serve MODEL \
--gpu-memory-utilization 0.7 \
--max-model-len 4096
Or use quantization:
vllm serve MODEL --quantization awq
Issue: Slow first token (TTFT > 1 second)
Enable prefix caching for repeated prompts:
vllm serve MODEL --enable-prefix-caching
For long prompts, enable chunked prefill:
vllm serve MODEL --enable-chunked-prefill
Issue: Model not found error
Use --trust-remote-code for custom models:
vllm serve MODEL --trust-remote-code
Issue: Low throughput (<50 req/sec)
Increase concurrent sequences:
vllm serve MODEL --max-num-seqs 512
Check GPU utilization with nvidia-smi - should be >80%.
Issue: Inference slower than expected
Verify tensor parallelism uses power of 2 GPUs:
vllm serve MODEL --tensor-parallel-size 4 # Not 3
Enable speculative decoding for faster generation:
vllm serve MODEL --speculative-model DRAFT_MODEL
Advanced topics
Server deployment patterns: See references/server-deployment.md for Docker, Kubernetes, and load balancing configurations.
Performance optimization: See references/optimization.md for PagedAttention tuning, continuous batching details, and benchmark results.
Quantization guide: See references/quantization.md for AWQ/GPTQ/FP8 setup, model preparation, and accuracy comparisons.
Troubleshooting: See references/troubleshooting.md for detailed error messages, debugging steps, and performance diagnostics.
Hardware requirements
- Small models (7B-13B): 1x A10 (24GB) or A100 (40GB)
- Medium models (30B-40B): 2x A100 (40GB) with tensor parallelism
- Large models (70B+): 4x A100 (40GB) or 2x A100 (80GB), use AWQ/GPTQ
Supported platforms: NVIDIA (primary), AMD ROCm, Intel GPUs, TPUs
Resources
- Official docs: https://docs.vllm.ai
- GitHub: https://github.com/vllm-project/vllm
- Paper: "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023)
- Community: https://discuss.vllm.ai
GitHub Repository
Related Skills
quantizing-models-bitsandbytes
OtherThis skill quantizes LLMs to 8-bit or 4-bit precision using bitsandbytes, reducing memory usage by 50-75% with minimal accuracy loss for GPU-constrained environments. It supports multiple formats (INT8, NF4, FP4) and enables QLoRA training and 8-bit optimizers. Use it with HuggingFace Transformers when you need to fit larger models into limited memory or accelerate inference.
sglang
MetaSGLang is a high-performance LLM serving framework that enables fast structured generation with JSON/regex outputs and constrained decoding. It's ideal for agentic workflows with tool calls and multi-turn conversations, offering significantly faster inference through RadixAttention prefix caching. Use it when you need production-scale performance with shared context across requests.
tensorrt-llm
OtherTensorRT-LLM is an NVIDIA-optimized library for deploying LLMs on NVIDIA GPUs, delivering up to 100x faster inference than PyTorch. Use it for production serving where you need maximum throughput, low latency, and support for features like quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.
langchain
MetaLangChain is a framework for building LLM applications using agents, chains, and RAG pipelines. It supports multiple LLM providers and offers key features like tool calling, memory management, and vector store retrieval. Use it for rapid prototyping or deploying production systems like chatbots, autonomous agents, and question-answering tools.
