Back to Skills

stable-diffusion-image-generation

davila7
Updated Today
327 views
18,478
1,685
18,478
View on GitHub
MetaImage GenerationStable DiffusionDiffusersText-to-ImageMultimodalComputer Vision

About

This skill enables text-to-image generation and image manipulation using Stable Diffusion via HuggingFace Diffusers. It supports image generation from prompts, image-to-image translation, inpainting, and custom pipeline creation. Developers should use it when building applications requiring AI-powered visual content generation or editing.

Quick Install

Claude Code

Recommended
Primary
npx skills add davila7/claude-code-templates
Plugin CommandAlternative
/plugin add https://github.com/davila7/claude-code-templates
Git CloneAlternative
git clone https://github.com/davila7/claude-code-templates.git ~/.claude/skills/stable-diffusion-image-generation

Copy and paste this command in Claude Code to install this skill

Documentation

Stable Diffusion Image Generation

Comprehensive guide to generating images with Stable Diffusion using the HuggingFace Diffusers library.

When to use Stable Diffusion

Use Stable Diffusion when:

  • Generating images from text descriptions
  • Performing image-to-image translation (style transfer, enhancement)
  • Inpainting (filling in masked regions)
  • Outpainting (extending images beyond boundaries)
  • Creating variations of existing images
  • Building custom image generation workflows

Key features:

  • Text-to-Image: Generate images from natural language prompts
  • Image-to-Image: Transform existing images with text guidance
  • Inpainting: Fill masked regions with context-aware content
  • ControlNet: Add spatial conditioning (edges, poses, depth)
  • LoRA Support: Efficient fine-tuning and style adaptation
  • Multiple Models: SD 1.5, SDXL, SD 3.0, Flux support

Use alternatives instead:

  • DALL-E 3: For API-based generation without GPU
  • Midjourney: For artistic, stylized outputs
  • Imagen: For Google Cloud integration
  • Leonardo.ai: For web-based creative workflows

Quick start

Installation

pip install diffusers transformers accelerate torch
pip install xformers  # Optional: memory-efficient attention

Basic text-to-image

from diffusers import DiffusionPipeline
import torch

# Load pipeline (auto-detects model type)
pipe = DiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16
)
pipe.to("cuda")

# Generate image
image = pipe(
    "A serene mountain landscape at sunset, highly detailed",
    num_inference_steps=50,
    guidance_scale=7.5
).images[0]

image.save("output.png")

Using SDXL (higher quality)

from diffusers import AutoPipelineForText2Image
import torch

pipe = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16"
)
pipe.to("cuda")

# Enable memory optimization
pipe.enable_model_cpu_offload()

image = pipe(
    prompt="A futuristic city with flying cars, cinematic lighting",
    height=1024,
    width=1024,
    num_inference_steps=30
).images[0]

Architecture overview

Three-pillar design

Diffusers is built around three core components:

Pipeline (orchestration)
├── Model (neural networks)
│   ├── UNet / Transformer (noise prediction)
│   ├── VAE (latent encoding/decoding)
│   └── Text Encoder (CLIP/T5)
└── Scheduler (denoising algorithm)

Pipeline inference flow

Text Prompt → Text Encoder → Text Embeddings
                                    ↓
Random Noise → [Denoising Loop] ← Scheduler
                      ↓
               Predicted Noise
                      ↓
              VAE Decoder → Final Image

Core concepts

Pipelines

Pipelines orchestrate complete workflows:

PipelinePurpose
StableDiffusionPipelineText-to-image (SD 1.x/2.x)
StableDiffusionXLPipelineText-to-image (SDXL)
StableDiffusion3PipelineText-to-image (SD 3.0)
FluxPipelineText-to-image (Flux models)
StableDiffusionImg2ImgPipelineImage-to-image
StableDiffusionInpaintPipelineInpainting

Schedulers

Schedulers control the denoising process:

SchedulerStepsQualityUse Case
EulerDiscreteScheduler20-50GoodDefault choice
EulerAncestralDiscreteScheduler20-50GoodMore variation
DPMSolverMultistepScheduler15-25ExcellentFast, high quality
DDIMScheduler50-100GoodDeterministic
LCMScheduler4-8GoodVery fast
UniPCMultistepScheduler15-25ExcellentFast convergence

Swapping schedulers

from diffusers import DPMSolverMultistepScheduler

# Swap for faster generation
pipe.scheduler = DPMSolverMultistepScheduler.from_config(
    pipe.scheduler.config
)

# Now generate with fewer steps
image = pipe(prompt, num_inference_steps=20).images[0]

Generation parameters

Key parameters

ParameterDefaultDescription
promptRequiredText description of desired image
negative_promptNoneWhat to avoid in the image
num_inference_steps50Denoising steps (more = better quality)
guidance_scale7.5Prompt adherence (7-12 typical)
height, width512/1024Output dimensions (multiples of 8)
generatorNoneTorch generator for reproducibility
num_images_per_prompt1Batch size

Reproducible generation

import torch

generator = torch.Generator(device="cuda").manual_seed(42)

image = pipe(
    prompt="A cat wearing a top hat",
    generator=generator,
    num_inference_steps=50
).images[0]

Negative prompts

image = pipe(
    prompt="Professional photo of a dog in a garden",
    negative_prompt="blurry, low quality, distorted, ugly, bad anatomy",
    guidance_scale=7.5
).images[0]

Image-to-image

Transform existing images with text guidance:

from diffusers import AutoPipelineForImage2Image
from PIL import Image

pipe = AutoPipelineForImage2Image.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")

init_image = Image.open("input.jpg").resize((512, 512))

image = pipe(
    prompt="A watercolor painting of the scene",
    image=init_image,
    strength=0.75,  # How much to transform (0-1)
    num_inference_steps=50
).images[0]

Inpainting

Fill masked regions:

from diffusers import AutoPipelineForInpainting
from PIL import Image

pipe = AutoPipelineForInpainting.from_pretrained(
    "runwayml/stable-diffusion-inpainting",
    torch_dtype=torch.float16
).to("cuda")

image = Image.open("photo.jpg")
mask = Image.open("mask.png")  # White = inpaint region

result = pipe(
    prompt="A red car parked on the street",
    image=image,
    mask_image=mask,
    num_inference_steps=50
).images[0]

ControlNet

Add spatial conditioning for precise control:

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch

# Load ControlNet for edge conditioning
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/control_v11p_sd15_canny",
    torch_dtype=torch.float16
)

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16
).to("cuda")

# Use Canny edge image as control
control_image = get_canny_image(input_image)

image = pipe(
    prompt="A beautiful house in the style of Van Gogh",
    image=control_image,
    num_inference_steps=30
).images[0]

Available ControlNets

ControlNetInput TypeUse Case
cannyEdge mapsPreserve structure
openposePose skeletonsHuman poses
depthDepth maps3D-aware generation
normalNormal mapsSurface details
mlsdLine segmentsArchitectural lines
scribbleRough sketchesSketch-to-image

LoRA adapters

Load fine-tuned style adapters:

from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")

# Load LoRA weights
pipe.load_lora_weights("path/to/lora", weight_name="style.safetensors")

# Generate with LoRA style
image = pipe("A portrait in the trained style").images[0]

# Adjust LoRA strength
pipe.fuse_lora(lora_scale=0.8)

# Unload LoRA
pipe.unload_lora_weights()

Multiple LoRAs

# Load multiple LoRAs
pipe.load_lora_weights("lora1", adapter_name="style")
pipe.load_lora_weights("lora2", adapter_name="character")

# Set weights for each
pipe.set_adapters(["style", "character"], adapter_weights=[0.7, 0.5])

image = pipe("A portrait").images[0]

Memory optimization

Enable CPU offloading

# Model CPU offload - moves models to CPU when not in use
pipe.enable_model_cpu_offload()

# Sequential CPU offload - more aggressive, slower
pipe.enable_sequential_cpu_offload()

Attention slicing

# Reduce memory by computing attention in chunks
pipe.enable_attention_slicing()

# Or specific chunk size
pipe.enable_attention_slicing("max")

xFormers memory-efficient attention

# Requires xformers package
pipe.enable_xformers_memory_efficient_attention()

VAE slicing for large images

# Decode latents in tiles for large images
pipe.enable_vae_slicing()
pipe.enable_vae_tiling()

Model variants

Loading different precisions

# FP16 (recommended for GPU)
pipe = DiffusionPipeline.from_pretrained(
    "model-id",
    torch_dtype=torch.float16,
    variant="fp16"
)

# BF16 (better precision, requires Ampere+ GPU)
pipe = DiffusionPipeline.from_pretrained(
    "model-id",
    torch_dtype=torch.bfloat16
)

Loading specific components

from diffusers import UNet2DConditionModel, AutoencoderKL

# Load custom VAE
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")

# Use with pipeline
pipe = DiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    vae=vae,
    torch_dtype=torch.float16
)

Batch generation

Generate multiple images efficiently:

# Multiple prompts
prompts = [
    "A cat playing piano",
    "A dog reading a book",
    "A bird painting a picture"
]

images = pipe(prompts, num_inference_steps=30).images

# Multiple images per prompt
images = pipe(
    "A beautiful sunset",
    num_images_per_prompt=4,
    num_inference_steps=30
).images

Common workflows

Workflow 1: High-quality generation

from diffusers import StableDiffusionXLPipeline, DPMSolverMultistepScheduler
import torch

# 1. Load SDXL with optimizations
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16"
)
pipe.to("cuda")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

# 2. Generate with quality settings
image = pipe(
    prompt="A majestic lion in the savanna, golden hour lighting, 8k, detailed fur",
    negative_prompt="blurry, low quality, cartoon, anime, sketch",
    num_inference_steps=30,
    guidance_scale=7.5,
    height=1024,
    width=1024
).images[0]

Workflow 2: Fast prototyping

from diffusers import AutoPipelineForText2Image, LCMScheduler
import torch

# Use LCM for 4-8 step generation
pipe = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16
).to("cuda")

# Load LCM LoRA for fast generation
pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
pipe.fuse_lora()

# Generate in ~1 second
image = pipe(
    "A beautiful landscape",
    num_inference_steps=4,
    guidance_scale=1.0
).images[0]

Common issues

CUDA out of memory:

# Enable memory optimizations
pipe.enable_model_cpu_offload()
pipe.enable_attention_slicing()
pipe.enable_vae_slicing()

# Or use lower precision
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)

Black/noise images:

# Check VAE configuration
# Use safety checker bypass if needed
pipe.safety_checker = None

# Ensure proper dtype consistency
pipe = pipe.to(dtype=torch.float16)

Slow generation:

# Use faster scheduler
from diffusers import DPMSolverMultistepScheduler
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

# Reduce steps
image = pipe(prompt, num_inference_steps=20).images[0]

References

Resources

GitHub Repository

davila7/claude-code-templates
Path: cli-tool/components/skills/ai-research/multimodal-stable-diffusion
anthropicanthropic-claudeclaudeclaude-code

Related Skills

blip-2-vision-language

Design

BLIP-2 is a vision-language framework that connects a frozen image encoder with a large language model for multimodal tasks. Use it for zero-shot image captioning, visual question answering, or image-text retrieval without task-specific fine-tuning. It's ideal for developers needing to add state-of-the-art visual understanding to LLM-based applications.

View skill

audiocraft-audio-generation

Meta

This Claude Skill provides text-to-music and text-to-audio generation using Meta's AudioCraft PyTorch library. It enables developers to generate music from descriptions, create sound effects, and perform melody-conditioned music generation. Key capabilities include using the MusicGen and AudioGen models for controllable, high-quality stereo audio output.

View skill

segment-anything-model

Meta

The segment-anything-model skill performs zero-shot image segmentation, allowing developers to isolate objects using prompts like points or bounding boxes, or to automatically generate all object masks. It's ideal for building annotation tools, generating training data, or processing images in new domains without task-specific training. Key capabilities include handling interactive prompts and providing strong out-of-the-box performance for various computer vision pipelines.

View skill

llamaindex

Meta

LlamaIndex is a data framework for building RAG applications, specializing in ingesting documents from numerous sources and indexing them for querying. It provides key components like vector indices and query engines to enable document Q&A, chatbots, and knowledge retrieval over private data. Use it when you need to connect LLMs to your own data for data-centric applications.

View skill