SKILL·E7DABD

analyze-generative-diffusion-model

Name: analyze-generative-diffusion-model
Author: pjt222

pjt222

Actualizado 1 month ago

9 vistas

Metaaidesign

Acerca de

Esta habilidad analiza modelos generativos de difusión preentrenados como Stable Diffusion mediante el cálculo de métricas de calidad (FID, puntuación CLIP), visualización de mapas de atención y exploración de espacios latentes. Úsela para evaluar la calidad de la salida del modelo, comparar programaciones de ruido o analizar patrones de atención cruzada en la generación condicionada por texto. Está diseñada para desarrolladores que realizan evaluación e inspección avanzada de modelos.

Instalación rápida

Claude Code

Recomendado

Principal

npx skills add pjt222/agent-almanac -a claude-code

Comando PluginAlternativo

/plugin add https://github.com/pjt222/agent-almanac

Git CloneAlternativo

git clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/analyze-generative-diffusion-model

Copia y pega este comando en Claude Code para instalar esta habilidad

Documentación

Analyze Generative Diffusion Model

Evaluate pre-trained generative diffusion via quant metrics, noise schedule inspect, cross-attention maps, latent probe → behavior, failure diagnosis, fine-tune decisions.

Use When

Eval pre-trained generative diffusion out quality, standard metrics
Compute FID, IS, CLIP, precision/recall for generated sets
Inspect + compare noise schedules (linear, cosine, learned) via SNR curves
Extract cross-attention maps → text-to-image token-region
Interpolate latent codes or discover semantic directions
Detect OOD in for diffusion pipeline

In

Required: Pre-trained model ID or checkpoint path (e.g., stabilityai/stable-diffusion-2-1)
Required: Mode — one+: metrics, schedule, attention, latent
Required: Reference dataset (real images or name)
Optional: Text prompts for attention (default: model-appropriate test prompts)
Optional: N samples for metrics (default: 10000)
Optional: Device (default: cuda if avail, else cpu)

Do

Step 1: Quant Evaluation

Standard generative quality metrics vs reference dataset.

Setup eval pipeline:

import torch
from diffusers import StableDiffusionPipeline
from torchmetrics.image.fid import FrechetInceptionDistance
from torchmetrics.image.inception import InceptionScore

device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = StableDiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16
).to(device)

fid = FrechetInceptionDistance(feature=2048, normalize=True).to(device)
inception = InceptionScore(normalize=True).to(device)

Feed real images:

from torch.utils.data import DataLoader

for batch in DataLoader(real_dataset, batch_size=64):
    imgs = (batch * 255).byte().to(device)
    fid.update(imgs, real=True)

Generate + accumulate fake stats:

prompts = load_evaluation_prompts("prompts.txt")  # one prompt per line
n_generated = 0
while n_generated < 10000:
    prompt_batch = prompts[n_generated:n_generated + 8]
    images = pipe(prompt_batch, num_inference_steps=50).images
    tensors = torch.stack([to_tensor(img) for img in images]).to(device)
    byte_imgs = (tensors * 255).byte()
    fid.update(byte_imgs, real=False)
    inception.update(byte_imgs)
    n_generated += len(images)

CLIP score → text-image align:

from torchmetrics.multimodal.clip_score import CLIPScore

clip_metric = CLIPScore(model_name_or_path="openai/clip-vit-large-patch14").to(device)
for prompt, image_tensor in zip(sampled_prompts, sampled_tensors):
    clip_metric.update(image_tensor.unsqueeze(0), [prompt])

print(f"FID: {fid.compute():.2f}")
print(f"IS:  {inception.compute()[0]:.2f} +/- {inception.compute()[1]:.2f}")
print(f"CLIP: {clip_metric.compute():.2f}")

Precision + recall → mode coverage:

from torchmetrics.image import FrechetInceptionDistance

# Precision: fraction of generated images near real manifold
# Recall: fraction of real images near generated manifold
# Use improved precision/recall (Kynkaanniemi et al., 2019) via
# feature embeddings from the Inception network

→ FID <30 for well-trained SD on benchmarks. IS >50 on ImageNet prompts. CLIP >25 for text-conditioned. Precision + recall both >0.6.

If err: FID >100 → verify real + generated same res + normalization. CLIP low but FID OK → model generates plausible no-prompt-match → check text encoder. ≥10K samples for stable FID.

Step 2: Noise Schedule Inspect

Visualize + compare forward + reverse schedules.

Extract schedule params:

scheduler = pipe.scheduler
betas = torch.tensor(scheduler.betas) if hasattr(scheduler, 'betas') else None
alphas_cumprod = torch.tensor(scheduler.alphas_cumprod)
timesteps = torch.arange(len(alphas_cumprod))

SNR curve:

import numpy as np
import matplotlib.pyplot as plt

snr = alphas_cumprod / (1 - alphas_cumprod)
log_snr = torch.log(snr)

fig, axes = plt.subplots(1, 3, figsize=(18, 5))
axes[0].plot(timesteps.numpy(), alphas_cumprod.numpy())
axes[0].set_xlabel("Timestep"); axes[0].set_ylabel("alpha_cumprod")
axes[0].set_title("Cumulative Signal Retention")

axes[1].plot(timesteps.numpy(), log_snr.numpy())
axes[1].set_xlabel("Timestep"); axes[1].set_ylabel("log(SNR)")
axes[1].set_title("Log Signal-to-Noise Ratio")

if betas is not None:
    axes[2].plot(timesteps.numpy(), betas.numpy())
    axes[2].set_xlabel("Timestep"); axes[2].set_ylabel("beta")
    axes[2].set_title("Beta Schedule")
fig.tight_layout()
fig.savefig("noise_schedule.png", dpi=150)

Compare schedule types:

from diffusers import DDPMScheduler

schedules = {
    "linear": DDPMScheduler(beta_schedule="linear", num_train_timesteps=1000),
    "cosine": DDPMScheduler(beta_schedule="squaredcos_cap_v2", num_train_timesteps=1000),
}

fig, ax = plt.subplots(figsize=(10, 6))
for name, sched in schedules.items():
    ac = torch.tensor(sched.alphas_cumprod)
    snr = torch.log(ac / (1 - ac))
    ax.plot(snr.numpy(), label=name)
ax.set_xlabel("Timestep"); ax.set_ylabel("log(SNR)")
ax.set_title("Schedule Comparison"); ax.legend()
fig.savefig("schedule_comparison.png", dpi=150)

→ Cosine → more gradual SNR decrease in mid-timesteps vs linear. Log-SNR span ~+10 (clean) to -10 (pure noise). Learned schedules monotonic decreasing.

If err: alphas_cumprod non-monotonic → misconfig. Constant → scheduler not init w/ model config. Custom schedulers → verify set_timesteps() called.

Step 3: Attention Map Analysis

Extract + visualize cross-attention from text-conditioned.

attention_maps = {}

def hook_fn(name):
    def fn(module, input, output):
        # Cross-attention: Q from image, K/V from text
        if hasattr(module, 'processor'):
            attention_maps[name] = output.detach().cpu()
    return fn

for name, module in pipe.unet.named_modules():
    if 'attn2' in name and hasattr(module, 'processor'):
        module.register_forward_hook(hook_fn(name))

Run inference + collect attention at specific timesteps:

prompt = "a red car parked next to a blue house"
timestep_attention = {}

# Custom callback to capture attention at specific timesteps
def callback_fn(pipe, step_index, timestep, callback_kwargs):
    if step_index in [5, 15, 30, 45]:
        timestep_attention[int(timestep)] = {
            k: v.clone() for k, v in attention_maps.items()
        }
    return callback_kwargs

output = pipe(prompt, num_inference_steps=50, callback_on_step_end=callback_fn)

Visualize token-region:

tokenizer = pipe.tokenizer
tokens = tokenizer.encode(prompt)
token_strings = [tokenizer.decode([t]) for t in tokens]

# Select a mid-resolution attention layer
layer_key = [k for k in attention_maps if 'mid' in k or 'up.1' in k][0]
attn = attention_maps[layer_key]  # shape: (batch, heads, hw, seq_len)
attn_avg = attn.mean(dim=1)  # average across heads
res = int(attn_avg.shape[1] ** 0.5)
attn_map = attn_avg[0].reshape(res, res, -1)

fig, axes = plt.subplots(2, min(len(token_strings), 6), figsize=(18, 6))
for idx, token in enumerate(token_strings[:6]):
    for row, (ts, ts_attn) in enumerate(list(timestep_attention.items())[:2]):
        a = ts_attn[layer_key].mean(dim=1)[0]
        a_res = int(a.shape[0] ** 0.5)
        axes[row, idx].imshow(a[:, idx].reshape(a_res, a_res), cmap="hot")
        axes[row, idx].set_title(f"t={ts}: '{token}'")
        axes[row, idx].axis("off")
fig.suptitle("Cross-Attention Maps by Token and Timestep")
fig.tight_layout()
fig.savefig("attention_maps.png", dpi=150)

→ Content tokens ("car", "house") → localized spatial regions. Style/color ("red", "blue") → regions overlapping w/ object. Early (high noise) diffuse; later sharp + localized.

If err: All uniform → hook capturing self-attention not cross → verify layer has attn2 (cross) not attn1 (self). Wrong dims → check out tensor indexing matches head count + spatial res.

Step 4: Latent Space Probe

Structure via interpolation + direction discovery.

Encode refs into latent space:

from diffusers import AutoencoderKL
from PIL import Image
import torchvision.transforms as T

vae = pipe.vae
transform = T.Compose([T.Resize(512), T.CenterCrop(512), T.ToTensor(),
                       T.Normalize([0.5], [0.5])])

def encode_image(image_path):
    img = transform(Image.open(image_path).convert("RGB")).unsqueeze(0).to(device)
    with torch.no_grad():
        latent = vae.encode(img.half()).latent_dist.sample() * vae.config.scaling_factor
    return latent

z1 = encode_image("image_a.png")
z2 = encode_image("image_b.png")

Spherical linear interpolation (slerp):

def slerp(z1, z2, alpha):
    """Spherical linear interpolation between two latent codes."""
    z1_flat = z1.flatten()
    z2_flat = z2.flatten()
    omega = torch.acos(torch.clamp(
        torch.dot(z1_flat, z2_flat) / (z1_flat.norm() * z2_flat.norm()), -1, 1
    ))
    if omega.abs() < 1e-6:
        return (1 - alpha) * z1 + alpha * z2
    return (torch.sin((1 - alpha) * omega) * z1 + torch.sin(alpha * omega) * z2) / torch.sin(omega)

alphas = torch.linspace(0, 1, 8)
interpolated = [slerp(z1, z2, a.item()) for a in alphas]
decoded = []
for z in interpolated:
    with torch.no_grad():
        img = vae.decode(z / vae.config.scaling_factor).sample
    decoded.append(img.cpu())

Discover semantic directions via prompt-pair diffs:

def get_text_embedding(prompt):
    tokens = pipe.tokenizer(prompt, return_tensors="pt", padding="max_length",
                            max_length=77, truncation=True).input_ids.to(device)
    with torch.no_grad():
        emb = pipe.text_encoder(tokens).last_hidden_state
    return emb

pos_emb = get_text_embedding("a happy person smiling")
neg_emb = get_text_embedding("a sad person frowning")
direction = pos_emb - neg_emb  # semantic direction in text embedding space

Detect OOD latents:

# Compute latent space statistics from a reference set
ref_latents = torch.stack([encode_image(p) for p in reference_paths])
ref_mean = ref_latents.mean(dim=0)
ref_std = ref_latents.std(dim=0)

def ood_score(z):
    """Mahalanobis-like OOD score (higher = more unusual)."""
    deviation = ((z - ref_mean) / (ref_std + 1e-6)).flatten()
    return deviation.norm().item()

test_z = encode_image("test_image.png")
score = ood_score(test_z)
print(f"OOD score: {score:.2f} (reference mean: {np.mean([ood_score(r) for r in ref_latents]):.2f})")

→ Interpolated images smooth semantic transitions no artifacts. Semantic directions → consistent attribute changes across diverse latents. In-dist OOD scores cluster tight; outliers score much higher.

If err: Blurry/incoherent midpoints → slerp not linear — linear traverses low-density regions in high-dim latents. Semantic directions no effect → increase magnitude or verify same text encoder as training.

Check

FID ≥10K generated + matching real sample count
CLIP computed w/ same CLIP model as training (if applicable)
Noise schedule viz shows monotonic decreasing alphas_cumprod
Log-SNR spans ~+10 to -10 across timestep range
Attention maps resolve per-token spatial at mid-res layers
Attention sharpens early (diffuse) → late (localized)
Latent interpolations smooth no sudden jumps/artifacts
OOD baseline ≥100 ref samples

Traps

FID mismatched res: Real + generated must be same res pre-Inception. Resize both identically or FID inflated.
Forget normalize for torchmetrics: FrechetInceptionDistance(normalize=True) → [0,1] float. normalize=False → [0,255] uint8. Mix → meaningless FID.
Hook self-attention not cross: attn1 = self (image-to-image). Use attn2 cross (text-to-image). Confuse → uninformative uniform.
Linear interp high dims: Linear between 2 high-dim Gaussians passes low-density shell. Always slerp in diffusion latents.
Ignore VAE scaling factor: SD latents scaled by vae.config.scaling_factor post-encode. Forget → garbled decode.
Too few samples precision/recall: <5K samples/set → unreliable. ≥10K for stable.

→

implement-diffusion-network — build diffusion models this skill evals
analyze-diffusion-dynamics — math foundations of inspected noise procs
fit-drift-diffusion-model — different diffusion family, same SDE foundations

Repositorio GitHub

pjt222/agent-almanac

Ruta: i18n/caveman-ultra/skills/analyze-generative-diffusion-model

agentsagentskillsai-assisted-developmentclaude-codeskillsteams

FAQ

Frequently asked questions

What is the analyze-generative-diffusion-model skill?

analyze-generative-diffusion-model is a Claude Skill by pjt222. Skills package instructions and resources that Claude loads on demand, so Claude can perform analyze-generative-diffusion-model-related tasks without extra prompting.

How do I install analyze-generative-diffusion-model?

Use the install commands on this page: add analyze-generative-diffusion-model to Claude Code as a plugin, or clone its repository into your skills directory, then restart Claude so it picks up the skill.

What category does analyze-generative-diffusion-model belong to?

analyze-generative-diffusion-model is in the Meta category, tagged ai and design.

Is analyze-generative-diffusion-model free to use?

Yes. analyze-generative-diffusion-model is listed on AIMCP and free to install. It runs inside Claude, so no separate service account is required to use the skill itself.

Habilidades relacionadas

content-collections

Meta

Esta habilidad proporciona una configuración probada en producción para Content Collections, una herramienta centrada en TypeScript que transforma archivos Markdown/MDX en colecciones de datos con tipado seguro mediante validación Zod. Úsala al construir blogs, sitios de documentación o aplicaciones Vite + React con mucho contenido para garantizar seguridad de tipos y validación automática de contenido. Abarca todo, desde la configuración del plugin de Vite y compilación MDX hasta la optimización de despliegue y validación de esquemas.

Ver habilidad

polymarket

Meta

Esta habilidad permite a los desarrolladores crear aplicaciones con la plataforma de mercados de predicción Polymarket, incluyendo la integración de API para operaciones y datos de mercado. También proporciona transmisión de datos en tiempo real a través de WebSocket para monitorear operaciones en vivo y actividad del mercado. Úsela para implementar estrategias de trading o crear herramientas que procesen actualizaciones de mercado en tiempo real.

Ver habilidad

creating-opencode-plugins

Meta

Esta habilidad ayuda a los desarrolladores a crear complementos de OpenCode que se conectan a más de 25 tipos de eventos, como comandos, archivos y operaciones LSP. Proporciona la estructura del complemento, las especificaciones de la API de eventos y los patrones de implementación para módulos en JavaScript/TypeScript. Úsala cuando necesites interceptar, monitorear o extender el ciclo de vida del asistente de IA de OpenCode con lógica personalizada basada en eventos.

Ver habilidad

sglang

Meta

SGLang es un framework de alto rendimiento para el servicio de LLM que se especializa en generación rápida y estructurada para JSON, expresiones regulares y flujos de trabajo de agentes utilizando su caché de prefijos RadixAttention. Ofrece una inferencia significativamente más rápida, especialmente para tareas con prefijos repetidos, lo que lo hace ideal para salidas complejas y estructuradas, y conversaciones multiturno. Elige SGLang sobre alternativas como vLLM cuando necesites decodificación restringida o estés construyendo aplicaciones con uso extensivo de prefijos compartidos.

Ver habilidad