cloudflare-workers-ai

jezweb

Updated Today

33 views

Metawordai

About

This skill provides comprehensive guidance for implementing AI inference using Cloudflare Workers AI. It covers running LLMs, generating text/images, configuring AI bindings, streaming responses, and integrating with AI Gateway and RAG systems. Use it when working with serverless AI models or troubleshooting common Workers AI errors.

Quick Install

Claude Code

Recommended

Plugin CommandRecommended

/plugin add https://github.com/jezweb/claude-skills

Git CloneAlternative

git clone https://github.com/jezweb/claude-skills.git ~/.claude/skills/cloudflare-workers-ai

Copy and paste this command in Claude Code to install this skill

Documentation

Cloudflare Workers AI - Complete Reference

Production-ready knowledge domain for building AI-powered applications with Cloudflare Workers AI.

Status: Production Ready ✅ Last Updated: 2025-10-21 Dependencies: cloudflare-worker-base (for Worker setup) Latest Versions: [email protected], @cloudflare/[email protected]

Quick Start (5 minutes)
Workers AI API Reference
Model Selection Guide
Common Patterns
AI Gateway Integration
Rate Limits & Pricing
Production Checklist

Quick Start (5 minutes)

1. Add AI Binding

wrangler.jsonc:

{
  "ai": {
    "binding": "AI"
  }
}

2. Run Your First Model

export interface Env {
  AI: Ai;
}

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
      prompt: 'What is Cloudflare?',
    });

    return Response.json(response);
  },
};

3. Add Streaming (Recommended)

const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [{ role: 'user', content: 'Tell me a story' }],
  stream: true, // Always use streaming for text generation!
});

return new Response(stream, {
  headers: { 'content-type': 'text/event-stream' },
});

Why streaming?

Prevents buffering large responses in memory
Faster time-to-first-token
Better user experience for long-form content
Avoids Worker timeout issues

Workers AI API Reference

`env.AI.run()`

Run an AI model inference.

Signature:

async env.AI.run(
  model: string,
  inputs: ModelInputs,
  options?: { gateway?: { id: string; skipCache?: boolean } }
): Promise<ModelOutput | ReadableStream>

Parameters:

model (string, required) - Model ID (e.g., @cf/meta/llama-3.1-8b-instruct)
inputs (object, required) - Model-specific inputs
options (object, optional) - Additional options
- gateway (object) - AI Gateway configuration
  - id (string) - Gateway ID
  - skipCache (boolean) - Skip AI Gateway cache

Returns:

Non-streaming: Promise<ModelOutput> - JSON response
Streaming: ReadableStream - Server-sent events stream

Text Generation Models

Input Format:

{
  messages?: Array<{ role: 'system' | 'user' | 'assistant'; content: string }>;
  prompt?: string; // Deprecated, use messages
  stream?: boolean; // Default: false
  max_tokens?: number; // Max tokens to generate
  temperature?: number; // 0.0-1.0, default varies by model
  top_p?: number; // 0.0-1.0
  top_k?: number;
}

Output Format (Non-Streaming):

{
  response: string; // Generated text
}

Example:

const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'What is TypeScript?' },
  ],
  stream: false,
});

console.log(response.response);

Text Embeddings Models

Input Format:

{
  text: string | string[]; // Single text or array of texts
}

Output Format:

{
  shape: number[]; // [batch_size, embedding_dimensions]
  data: number[][]; // Array of embedding vectors
}

Example:

const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
  text: ['Hello world', 'Cloudflare Workers'],
});

console.log(embeddings.shape); // [2, 768]
console.log(embeddings.data[0]); // [0.123, -0.456, ...]

Image Generation Models

Input Format:

{
  prompt: string; // Text description
  num_steps?: number; // Default: 20
  guidance?: number; // CFG scale, default: 7.5
  strength?: number; // For img2img, default: 1.0
  image?: number[][]; // For img2img (base64 or array)
}

Output Format:

Binary image data (PNG/JPEG)

Example:

const imageStream = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
  prompt: 'A beautiful sunset over mountains',
});

return new Response(imageStream, {
  headers: { 'content-type': 'image/png' },
});

Vision Models

Input Format:

{
  messages: Array<{
    role: 'user' | 'assistant';
    content: Array<{ type: 'text' | 'image_url'; text?: string; image_url?: { url: string } }>;
  }>;
}

Example:

const response = await env.AI.run('@cf/meta/llama-3.2-11b-vision-instruct', {
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'What is in this image?' },
        { type: 'image_url', image_url: { url: 'data:image/png;base64,iVBOR...' } },
      ],
    },
  ],
});

Model Selection Guide

Text Generation (LLMs)

Model	Best For	Rate Limit	Size
`@cf/meta/llama-3.1-8b-instruct`	General purpose, fast	300/min	8B
`@cf/meta/llama-3.2-1b-instruct`	Ultra-fast, simple tasks	300/min	1B
`@cf/qwen/qwen1.5-14b-chat-awq`	High quality, complex reasoning	150/min	14B
`@cf/deepseek-ai/deepseek-r1-distill-qwen-32b`	Coding, technical content	300/min	32B
`@hf/thebloke/mistral-7b-instruct-v0.1-awq`	Fast, efficient	400/min	7B

Text Embeddings

Model	Dimensions	Best For	Rate Limit
`@cf/baai/bge-base-en-v1.5`	768	General purpose RAG	3000/min
`@cf/baai/bge-large-en-v1.5`	1024	High accuracy search	1500/min
`@cf/baai/bge-small-en-v1.5`	384	Fast, low storage	3000/min

Image Generation

Model	Best For	Rate Limit	Speed
`@cf/black-forest-labs/flux-1-schnell`	High quality, photorealistic	720/min	Fast
`@cf/stabilityai/stable-diffusion-xl-base-1.0`	General purpose	720/min	Medium
`@cf/lykon/dreamshaper-8-lcm`	Artistic, stylized	720/min	Fast

Vision Models

Model	Best For	Rate Limit
`@cf/meta/llama-3.2-11b-vision-instruct`	Image understanding	720/min
`@cf/unum/uform-gen2-qwen-500m`	Fast image captioning	720/min

Common Patterns

Pattern 1: Chat Completion with History

app.post('/chat', async (c) => {
  const { messages } = await c.req.json<{
    messages: Array<{ role: string; content: string }>;
  }>();

  const response = await c.env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
    messages,
    stream: true,
  });

  return new Response(response, {
    headers: { 'content-type': 'text/event-stream' },
  });
});

Pattern 2: RAG (Retrieval Augmented Generation)

// Step 1: Generate embeddings
const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
  text: [userQuery],
});

const vector = embeddings.data[0];

// Step 2: Search Vectorize
const matches = await env.VECTORIZE.query(vector, { topK: 3 });

// Step 3: Build context from matches
const context = matches.matches.map((m) => m.metadata.text).join('\n\n');

// Step 4: Generate response with context
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [
    {
      role: 'system',
      content: `Answer using this context:\n${context}`,
    },
    { role: 'user', content: userQuery },
  ],
  stream: true,
});

return new Response(response, {
  headers: { 'content-type': 'text/event-stream' },
});

Pattern 3: Structured Output with Zod

import { z } from 'zod';

const RecipeSchema = z.object({
  name: z.string(),
  ingredients: z.array(z.string()),
  instructions: z.array(z.string()),
  prepTime: z.number(),
});

app.post('/recipe', async (c) => {
  const { dish } = await c.req.json<{ dish: string }>();

  const response = await c.env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
    messages: [
      {
        role: 'user',
        content: `Generate a recipe for ${dish}. Return ONLY valid JSON matching this schema: ${JSON.stringify(RecipeSchema.shape)}`,
      },
    ],
  });

  // Parse and validate
  const recipe = RecipeSchema.parse(JSON.parse(response.response));

  return c.json(recipe);
});

Pattern 4: Image Generation + R2 Storage

app.post('/generate-image', async (c) => {
  const { prompt } = await c.req.json<{ prompt: string }>();

  // Generate image
  const imageStream = await c.env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
    prompt,
  });

  const imageBytes = await new Response(imageStream).bytes();

  // Store in R2
  const key = `images/${Date.now()}.png`;
  await c.env.BUCKET.put(key, imageBytes, {
    httpMetadata: { contentType: 'image/png' },
  });

  return c.json({
    success: true,
    url: `https://your-domain.com/${key}`,
  });
});

AI Gateway Integration

AI Gateway provides caching, logging, and analytics for AI requests.

Setup:

const response = await env.AI.run(
  '@cf/meta/llama-3.1-8b-instruct',
  { prompt: 'Hello' },
  {
    gateway: {
      id: 'my-gateway', // Your gateway ID
      skipCache: false, // Use cache
    },
  }
);

Benefits:

✅ Cost Tracking - Monitor neurons usage per request
✅ Caching - Reduce duplicate inference costs
✅ Logging - Debug and analyze AI requests
✅ Rate Limiting - Additional layer of protection
✅ Analytics - Request patterns and performance

Access Gateway Logs:

const gateway = env.AI.gateway('my-gateway');
const logId = env.AI.aiGatewayLogId;

// Send feedback
await gateway.patchLog(logId, {
  feedback: { rating: 1, comment: 'Great response' },
});

Rate Limits & Pricing

Rate Limits (per minute)

Task Type	Default Limit	Notes
Text Generation	300/min	Some fast models: 400-1500/min
Text Embeddings	3000/min	BGE-large: 1500/min
Image Generation	720/min	All image models
Vision Models	720/min	Image understanding
Translation	720/min	M2M100, Opus MT
Classification	2000/min	Text classification
Speech Recognition	720/min	Whisper models

Pricing (Neurons-Based)

Free Tier:

10,000 neurons per day
Resets daily at 00:00 UTC

Paid Tier:

$0.011 per 1,000 neurons
10,000 neurons/day included
Unlimited usage above free allocation

Example Costs:

Model	Input (1M tokens)	Output (1M tokens)
Llama 3.2 1B	$0.027	$0.201
Llama 3.1 8B	$0.088	$0.606
BGE-base embeddings	$0.005	N/A
Flux image generation	~$0.011/image	N/A

Production Checklist

Before Deploying

Error Handling

async function runAIWithRetry(
  env: Env,
  model: string,
  inputs: any,
  maxRetries = 3
): Promise<any> {
  let lastError: Error;

  for (let i = 0; i < maxRetries; i++) {
    try {
      return await env.AI.run(model, inputs);
    } catch (error) {
      lastError = error as Error;
      const message = lastError.message.toLowerCase();

      // Rate limit - retry with backoff
      if (message.includes('429') || message.includes('rate limit')) {
        const delay = Math.pow(2, i) * 1000; // Exponential backoff
        await new Promise((resolve) => setTimeout(resolve, delay));
        continue;
      }

      // Other errors - throw immediately
      throw error;
    }
  }

  throw lastError!;
}

Monitoring

app.use('*', async (c, next) => {
  const start = Date.now();

  await next();

  // Log AI usage
  console.log({
    path: c.req.path,
    duration: Date.now() - start,
    logId: c.env.AI.aiGatewayLogId,
  });
});

OpenAI Compatibility

Workers AI supports OpenAI-compatible endpoints.

Using OpenAI SDK:

import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: env.CLOUDFLARE_API_KEY,
  baseURL: `https://api.cloudflare.com/client/v4/accounts/${env.CLOUDFLARE_ACCOUNT_ID}/ai/v1`,
});

// Chat completions
const completion = await openai.chat.completions.create({
  model: '@cf/meta/llama-3.1-8b-instruct',
  messages: [{ role: 'user', content: 'Hello!' }],
});

// Embeddings
const embeddings = await openai.embeddings.create({
  model: '@cf/baai/bge-base-en-v1.5',
  input: 'Hello world',
});

Endpoints:

/v1/chat/completions - Text generation
/v1/embeddings - Text embeddings

Vercel AI SDK Integration

npm install workers-ai-provider ai

import { createWorkersAI } from 'workers-ai-provider';
import { generateText, streamText } from 'ai';

const workersai = createWorkersAI({ binding: env.AI });

// Generate text
const result = await generateText({
  model: workersai('@cf/meta/llama-3.1-8b-instruct'),
  prompt: 'Write a poem',
});

// Stream text
const stream = streamText({
  model: workersai('@cf/meta/llama-3.1-8b-instruct'),
  prompt: 'Tell me a story',
});

Limits Summary

Feature	Limit
Concurrent requests	No hard limit (rate limits apply)
Max input tokens	Varies by model (typically 2K-128K)
Max output tokens	Varies by model (typically 512-2048)
Streaming chunk size	~1 KB
Image size (output)	~5 MB
Request timeout	Workers timeout applies (30s default, 5m max CPU)
Daily free neurons	10,000
Rate limits	See "Rate Limits & Pricing" section

References

GitHub Repository

jezweb/claude-skills

Path: skills/cloudflare-workers-ai

aiautomationclaude-codeclaude-skillscloudflarereact

Related Skills

sglang

evaluating-llms-harness

Testing

This Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.

View skill

llamaguard

Other

LlamaGuard is Meta's 7-8B parameter model for moderating LLM inputs and outputs across six safety categories like violence and hate speech. It offers 94-95% accuracy and can be deployed using vLLM, Hugging Face, or Amazon SageMaker. Use this skill to easily integrate content filtering and safety guardrails into your AI applications.

View skill

cloudflare-workers-ai

About

Quick Install

Claude Code

Documentation

Cloudflare Workers AI - Complete Reference

Table of Contents

Quick Start (5 minutes)

1. Add AI Binding

2. Run Your First Model

3. Add Streaming (Recommended)

Workers AI API Reference

env.AI.run()

Text Generation Models

Text Embeddings Models

Image Generation Models

Vision Models

Model Selection Guide

Text Generation (LLMs)

Text Embeddings

Image Generation

Vision Models

Common Patterns

Pattern 1: Chat Completion with History

Pattern 2: RAG (Retrieval Augmented Generation)

Pattern 3: Structured Output with Zod

Pattern 4: Image Generation + R2 Storage

AI Gateway Integration

Rate Limits & Pricing

Rate Limits (per minute)

Pricing (Neurons-Based)

Production Checklist

Before Deploying

Error Handling

Monitoring

OpenAI Compatibility

Vercel AI SDK Integration

Limits Summary

References

GitHub Repository

Related Skills

sglang

evaluating-llms-harness

llamaguard

langchain

`env.AI.run()`