cloudflare-workers-ai
About
This skill provides comprehensive guidance for implementing AI inference using Cloudflare Workers AI. It covers running LLMs, generating text/images, configuring AI bindings, streaming responses, and integrating with AI Gateway and RAG systems. Use it when working with serverless AI models or troubleshooting common Workers AI errors.
Quick Install
Claude Code
Recommended/plugin add https://github.com/jezweb/claude-skillsgit clone https://github.com/jezweb/claude-skills.git ~/.claude/skills/cloudflare-workers-aiCopy and paste this command in Claude Code to install this skill
Documentation
Cloudflare Workers AI - Complete Reference
Production-ready knowledge domain for building AI-powered applications with Cloudflare Workers AI.
Status: Production Ready ✅ Last Updated: 2025-10-21 Dependencies: cloudflare-worker-base (for Worker setup) Latest Versions: [email protected], @cloudflare/[email protected]
Table of Contents
- Quick Start (5 minutes)
- Workers AI API Reference
- Model Selection Guide
- Common Patterns
- AI Gateway Integration
- Rate Limits & Pricing
- Production Checklist
Quick Start (5 minutes)
1. Add AI Binding
wrangler.jsonc:
{
"ai": {
"binding": "AI"
}
}
2. Run Your First Model
export interface Env {
AI: Ai;
}
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
prompt: 'What is Cloudflare?',
});
return Response.json(response);
},
};
3. Add Streaming (Recommended)
const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [{ role: 'user', content: 'Tell me a story' }],
stream: true, // Always use streaming for text generation!
});
return new Response(stream, {
headers: { 'content-type': 'text/event-stream' },
});
Why streaming?
- Prevents buffering large responses in memory
- Faster time-to-first-token
- Better user experience for long-form content
- Avoids Worker timeout issues
Workers AI API Reference
env.AI.run()
Run an AI model inference.
Signature:
async env.AI.run(
model: string,
inputs: ModelInputs,
options?: { gateway?: { id: string; skipCache?: boolean } }
): Promise<ModelOutput | ReadableStream>
Parameters:
model(string, required) - Model ID (e.g.,@cf/meta/llama-3.1-8b-instruct)inputs(object, required) - Model-specific inputsoptions(object, optional) - Additional optionsgateway(object) - AI Gateway configurationid(string) - Gateway IDskipCache(boolean) - Skip AI Gateway cache
Returns:
- Non-streaming:
Promise<ModelOutput>- JSON response - Streaming:
ReadableStream- Server-sent events stream
Text Generation Models
Input Format:
{
messages?: Array<{ role: 'system' | 'user' | 'assistant'; content: string }>;
prompt?: string; // Deprecated, use messages
stream?: boolean; // Default: false
max_tokens?: number; // Max tokens to generate
temperature?: number; // 0.0-1.0, default varies by model
top_p?: number; // 0.0-1.0
top_k?: number;
}
Output Format (Non-Streaming):
{
response: string; // Generated text
}
Example:
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'What is TypeScript?' },
],
stream: false,
});
console.log(response.response);
Text Embeddings Models
Input Format:
{
text: string | string[]; // Single text or array of texts
}
Output Format:
{
shape: number[]; // [batch_size, embedding_dimensions]
data: number[][]; // Array of embedding vectors
}
Example:
const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
text: ['Hello world', 'Cloudflare Workers'],
});
console.log(embeddings.shape); // [2, 768]
console.log(embeddings.data[0]); // [0.123, -0.456, ...]
Image Generation Models
Input Format:
{
prompt: string; // Text description
num_steps?: number; // Default: 20
guidance?: number; // CFG scale, default: 7.5
strength?: number; // For img2img, default: 1.0
image?: number[][]; // For img2img (base64 or array)
}
Output Format:
- Binary image data (PNG/JPEG)
Example:
const imageStream = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
prompt: 'A beautiful sunset over mountains',
});
return new Response(imageStream, {
headers: { 'content-type': 'image/png' },
});
Vision Models
Input Format:
{
messages: Array<{
role: 'user' | 'assistant';
content: Array<{ type: 'text' | 'image_url'; text?: string; image_url?: { url: string } }>;
}>;
}
Example:
const response = await env.AI.run('@cf/meta/llama-3.2-11b-vision-instruct', {
messages: [
{
role: 'user',
content: [
{ type: 'text', text: 'What is in this image?' },
{ type: 'image_url', image_url: { url: 'data:image/png;base64,iVBOR...' } },
],
},
],
});
Model Selection Guide
Text Generation (LLMs)
| Model | Best For | Rate Limit | Size |
|---|---|---|---|
@cf/meta/llama-3.1-8b-instruct | General purpose, fast | 300/min | 8B |
@cf/meta/llama-3.2-1b-instruct | Ultra-fast, simple tasks | 300/min | 1B |
@cf/qwen/qwen1.5-14b-chat-awq | High quality, complex reasoning | 150/min | 14B |
@cf/deepseek-ai/deepseek-r1-distill-qwen-32b | Coding, technical content | 300/min | 32B |
@hf/thebloke/mistral-7b-instruct-v0.1-awq | Fast, efficient | 400/min | 7B |
Text Embeddings
| Model | Dimensions | Best For | Rate Limit |
|---|---|---|---|
@cf/baai/bge-base-en-v1.5 | 768 | General purpose RAG | 3000/min |
@cf/baai/bge-large-en-v1.5 | 1024 | High accuracy search | 1500/min |
@cf/baai/bge-small-en-v1.5 | 384 | Fast, low storage | 3000/min |
Image Generation
| Model | Best For | Rate Limit | Speed |
|---|---|---|---|
@cf/black-forest-labs/flux-1-schnell | High quality, photorealistic | 720/min | Fast |
@cf/stabilityai/stable-diffusion-xl-base-1.0 | General purpose | 720/min | Medium |
@cf/lykon/dreamshaper-8-lcm | Artistic, stylized | 720/min | Fast |
Vision Models
| Model | Best For | Rate Limit |
|---|---|---|
@cf/meta/llama-3.2-11b-vision-instruct | Image understanding | 720/min |
@cf/unum/uform-gen2-qwen-500m | Fast image captioning | 720/min |
Common Patterns
Pattern 1: Chat Completion with History
app.post('/chat', async (c) => {
const { messages } = await c.req.json<{
messages: Array<{ role: string; content: string }>;
}>();
const response = await c.env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages,
stream: true,
});
return new Response(response, {
headers: { 'content-type': 'text/event-stream' },
});
});
Pattern 2: RAG (Retrieval Augmented Generation)
// Step 1: Generate embeddings
const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
text: [userQuery],
});
const vector = embeddings.data[0];
// Step 2: Search Vectorize
const matches = await env.VECTORIZE.query(vector, { topK: 3 });
// Step 3: Build context from matches
const context = matches.matches.map((m) => m.metadata.text).join('\n\n');
// Step 4: Generate response with context
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [
{
role: 'system',
content: `Answer using this context:\n${context}`,
},
{ role: 'user', content: userQuery },
],
stream: true,
});
return new Response(response, {
headers: { 'content-type': 'text/event-stream' },
});
Pattern 3: Structured Output with Zod
import { z } from 'zod';
const RecipeSchema = z.object({
name: z.string(),
ingredients: z.array(z.string()),
instructions: z.array(z.string()),
prepTime: z.number(),
});
app.post('/recipe', async (c) => {
const { dish } = await c.req.json<{ dish: string }>();
const response = await c.env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [
{
role: 'user',
content: `Generate a recipe for ${dish}. Return ONLY valid JSON matching this schema: ${JSON.stringify(RecipeSchema.shape)}`,
},
],
});
// Parse and validate
const recipe = RecipeSchema.parse(JSON.parse(response.response));
return c.json(recipe);
});
Pattern 4: Image Generation + R2 Storage
app.post('/generate-image', async (c) => {
const { prompt } = await c.req.json<{ prompt: string }>();
// Generate image
const imageStream = await c.env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
prompt,
});
const imageBytes = await new Response(imageStream).bytes();
// Store in R2
const key = `images/${Date.now()}.png`;
await c.env.BUCKET.put(key, imageBytes, {
httpMetadata: { contentType: 'image/png' },
});
return c.json({
success: true,
url: `https://your-domain.com/${key}`,
});
});
AI Gateway Integration
AI Gateway provides caching, logging, and analytics for AI requests.
Setup:
const response = await env.AI.run(
'@cf/meta/llama-3.1-8b-instruct',
{ prompt: 'Hello' },
{
gateway: {
id: 'my-gateway', // Your gateway ID
skipCache: false, // Use cache
},
}
);
Benefits:
- ✅ Cost Tracking - Monitor neurons usage per request
- ✅ Caching - Reduce duplicate inference costs
- ✅ Logging - Debug and analyze AI requests
- ✅ Rate Limiting - Additional layer of protection
- ✅ Analytics - Request patterns and performance
Access Gateway Logs:
const gateway = env.AI.gateway('my-gateway');
const logId = env.AI.aiGatewayLogId;
// Send feedback
await gateway.patchLog(logId, {
feedback: { rating: 1, comment: 'Great response' },
});
Rate Limits & Pricing
Rate Limits (per minute)
| Task Type | Default Limit | Notes |
|---|---|---|
| Text Generation | 300/min | Some fast models: 400-1500/min |
| Text Embeddings | 3000/min | BGE-large: 1500/min |
| Image Generation | 720/min | All image models |
| Vision Models | 720/min | Image understanding |
| Translation | 720/min | M2M100, Opus MT |
| Classification | 2000/min | Text classification |
| Speech Recognition | 720/min | Whisper models |
Pricing (Neurons-Based)
Free Tier:
- 10,000 neurons per day
- Resets daily at 00:00 UTC
Paid Tier:
- $0.011 per 1,000 neurons
- 10,000 neurons/day included
- Unlimited usage above free allocation
Example Costs:
| Model | Input (1M tokens) | Output (1M tokens) |
|---|---|---|
| Llama 3.2 1B | $0.027 | $0.201 |
| Llama 3.1 8B | $0.088 | $0.606 |
| BGE-base embeddings | $0.005 | N/A |
| Flux image generation | ~$0.011/image | N/A |
Production Checklist
Before Deploying
- Enable AI Gateway for cost tracking and logging
- Implement streaming for all text generation endpoints
- Add rate limit retry with exponential backoff
- Validate input length to prevent token limit errors
- Set appropriate timeouts (Workers: 30s CPU default, 5m max)
- Monitor neurons usage in Cloudflare dashboard
- Test error handling for model unavailable, rate limits
- Add input sanitization to prevent prompt injection
- Configure CORS if using from browser
- Plan for scale - upgrade to Paid plan if needed
Error Handling
async function runAIWithRetry(
env: Env,
model: string,
inputs: any,
maxRetries = 3
): Promise<any> {
let lastError: Error;
for (let i = 0; i < maxRetries; i++) {
try {
return await env.AI.run(model, inputs);
} catch (error) {
lastError = error as Error;
const message = lastError.message.toLowerCase();
// Rate limit - retry with backoff
if (message.includes('429') || message.includes('rate limit')) {
const delay = Math.pow(2, i) * 1000; // Exponential backoff
await new Promise((resolve) => setTimeout(resolve, delay));
continue;
}
// Other errors - throw immediately
throw error;
}
}
throw lastError!;
}
Monitoring
app.use('*', async (c, next) => {
const start = Date.now();
await next();
// Log AI usage
console.log({
path: c.req.path,
duration: Date.now() - start,
logId: c.env.AI.aiGatewayLogId,
});
});
OpenAI Compatibility
Workers AI supports OpenAI-compatible endpoints.
Using OpenAI SDK:
import OpenAI from 'openai';
const openai = new OpenAI({
apiKey: env.CLOUDFLARE_API_KEY,
baseURL: `https://api.cloudflare.com/client/v4/accounts/${env.CLOUDFLARE_ACCOUNT_ID}/ai/v1`,
});
// Chat completions
const completion = await openai.chat.completions.create({
model: '@cf/meta/llama-3.1-8b-instruct',
messages: [{ role: 'user', content: 'Hello!' }],
});
// Embeddings
const embeddings = await openai.embeddings.create({
model: '@cf/baai/bge-base-en-v1.5',
input: 'Hello world',
});
Endpoints:
/v1/chat/completions- Text generation/v1/embeddings- Text embeddings
Vercel AI SDK Integration
npm install workers-ai-provider ai
import { createWorkersAI } from 'workers-ai-provider';
import { generateText, streamText } from 'ai';
const workersai = createWorkersAI({ binding: env.AI });
// Generate text
const result = await generateText({
model: workersai('@cf/meta/llama-3.1-8b-instruct'),
prompt: 'Write a poem',
});
// Stream text
const stream = streamText({
model: workersai('@cf/meta/llama-3.1-8b-instruct'),
prompt: 'Tell me a story',
});
Limits Summary
| Feature | Limit |
|---|---|
| Concurrent requests | No hard limit (rate limits apply) |
| Max input tokens | Varies by model (typically 2K-128K) |
| Max output tokens | Varies by model (typically 512-2048) |
| Streaming chunk size | ~1 KB |
| Image size (output) | ~5 MB |
| Request timeout | Workers timeout applies (30s default, 5m max CPU) |
| Daily free neurons | 10,000 |
| Rate limits | See "Rate Limits & Pricing" section |
References
GitHub Repository
Related Skills
sglang
MetaSGLang is a high-performance LLM serving framework that specializes in fast, structured generation for JSON, regex, and agentic workflows using its RadixAttention prefix caching. It delivers significantly faster inference, especially for tasks with repeated prefixes, making it ideal for complex, structured outputs and multi-turn conversations. Choose SGLang over alternatives like vLLM when you need constrained decoding or are building applications with extensive prefix sharing.
evaluating-llms-harness
TestingThis Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.
llamaguard
OtherLlamaGuard is Meta's 7-8B parameter model for moderating LLM inputs and outputs across six safety categories like violence and hate speech. It offers 94-95% accuracy and can be deployed using vLLM, Hugging Face, or Amazon SageMaker. Use this skill to easily integrate content filtering and safety guardrails into your AI applications.
langchain
MetaLangChain is a framework for building LLM applications using agents, chains, and RAG pipelines. It supports multiple LLM providers, offers 500+ integrations, and includes features like tool calling and memory management. Use it for rapid prototyping and deploying production systems like chatbots, autonomous agents, and question-answering services.
