Back to Skills

ai-multimodal

mrgoonie
Updated Today
40 views
738
118
738
View on GitHub
Metapdfwordaiapidesigndata

About

The ai-multimodal skill enables developers to process and generate multimedia content using the Google Gemini API. It provides comprehensive capabilities for analyzing audio, images, videos, and documents—including transcription, object detection, scene analysis, and PDF extraction. Use this skill when implementing multimodal AI features that require working with diverse media types or extracting structured data from files.

Documentation

AI Multimodal Processing Skill

Process audio, images, videos, documents, and generate images using Google Gemini's multimodal API. Unified interface for all multimedia content understanding and generation.

Core Capabilities

Audio Processing

  • Transcription with timestamps (up to 9.5 hours)
  • Audio summarization and analysis
  • Speech understanding and speaker identification
  • Music and environmental sound analysis
  • Text-to-speech generation with controllable voice

Image Understanding

  • Image captioning and description
  • Object detection with bounding boxes (2.0+)
  • Pixel-level segmentation (2.5+)
  • Visual question answering
  • Multi-image comparison (up to 3,600 images)
  • OCR and text extraction

Video Analysis

  • Scene detection and summarization
  • Video Q&A with temporal understanding
  • Transcription with visual descriptions
  • YouTube URL support
  • Long video processing (up to 6 hours)
  • Frame-level analysis

Document Extraction

  • Native PDF vision processing (up to 1,000 pages)
  • Table and form extraction
  • Chart and diagram analysis
  • Multi-page document understanding
  • Structured data output (JSON schema)
  • Format conversion (PDF to HTML/JSON)

Image Generation

  • Text-to-image generation
  • Image editing and modification
  • Multi-image composition (up to 3 images)
  • Iterative refinement
  • Multiple aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4)
  • Controllable style and quality

Capability Matrix

TaskAudioImageVideoDocumentGeneration
Transcription---
Summarization-
Q&A-
Object Detection---
Text Extraction---
Structured Output-
CreationTTS---
Timestamps---
Segmentation----

Model Selection Guide

Gemini 2.5 Series (Recommended)

  • gemini-2.5-pro: Highest quality, all features, 1M-2M context
  • gemini-2.5-flash: Best balance, all features, 1M-2M context
  • gemini-2.5-flash-lite: Lightweight, segmentation support
  • gemini-2.5-flash-image: Image generation only

Gemini 2.0 Series

  • gemini-2.0-flash: Fast processing, object detection
  • gemini-2.0-flash-lite: Lightweight option

Feature Requirements

  • Segmentation: Requires 2.5+ models
  • Object Detection: Requires 2.0+ models
  • Multi-video: Requires 2.5+ models
  • Image Generation: Requires flash-image model

Context Windows

  • 2M tokens: ~6 hours video (low-res) or ~2 hours (default)
  • 1M tokens: ~3 hours video (low-res) or ~1 hour (default)
  • Audio: 32 tokens/second (1 min = 1,920 tokens)
  • PDF: 258 tokens/page (fixed)
  • Image: 258-1,548 tokens based on size

Quick Start

Prerequisites

API Key Setup: Supports both Google AI Studio and Vertex AI.

The skill checks for GEMINI_API_KEY in this order:

  1. Process environment: export GEMINI_API_KEY="your-key"
  2. Project root: .env
  3. .claude/.env
  4. .claude/skills/.env
  5. .claude/skills/ai-multimodal/.env

Get API key: https://aistudio.google.com/apikey

For Vertex AI:

export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1  # Optional

Install SDK:

pip install google-genai python-dotenv pillow

Common Patterns

Transcribe Audio:

python scripts/gemini_batch_process.py \
  --files audio.mp3 \
  --task transcribe \
  --model gemini-2.5-flash

Analyze Image:

python scripts/gemini_batch_process.py \
  --files image.jpg \
  --task analyze \
  --prompt "Describe this image" \
  --output docs/assets/<output-name>.md \
  --model gemini-2.5-flash

Process Video:

python scripts/gemini_batch_process.py \
  --files video.mp4 \
  --task analyze \
  --prompt "Summarize key points with timestamps" \
  --output docs/assets/<output-name>.md \
  --model gemini-2.5-flash

Extract from PDF:

python scripts/gemini_batch_process.py \
  --files document.pdf \
  --task extract \
  --prompt "Extract table data as JSON" \
  --output docs/assets/<output-name>.md \
  --format json

Generate Image:

python scripts/gemini_batch_process.py \
  --task generate \
  --prompt "A futuristic city at sunset" \
  --output docs/assets/<output-file-name> \
  --model gemini-2.5-flash-image \
  --aspect-ratio 16:9

Optimize Media:

# Prepare large video for processing
python scripts/media_optimizer.py \
  --input large-video.mp4 \
  --output docs/assets/<output-file-name> \
  --target-size 100MB

# Batch optimize multiple files
python scripts/media_optimizer.py \
  --input-dir ./videos \
  --output-dir docs/assets/optimized \
  --quality 85

Convert Documents to Markdown:

# Convert to PDF
python scripts/document_converter.py \
  --input document.docx \
  --output docs/assets/document.md

# Extract pages
python scripts/document_converter.py \
  --input large.pdf \
  --output docs/assets/chapter1.md \
  --pages 1-20

Supported Formats

Audio

  • WAV, MP3, AAC, FLAC, OGG Vorbis, AIFF
  • Max 9.5 hours per request
  • Auto-downsampled to 16 Kbps mono

Images

  • PNG, JPEG, WEBP, HEIC, HEIF
  • Max 3,600 images per request
  • Resolution: ≤384px = 258 tokens, larger = tiled

Video

  • MP4, MPEG, MOV, AVI, FLV, MPG, WebM, WMV, 3GPP
  • Max 6 hours (low-res) or 2 hours (default)
  • YouTube URLs supported (public only)

Documents

  • PDF only for vision processing
  • Max 1,000 pages
  • TXT, HTML, Markdown supported (text-only)

Size Limits

  • Inline: <20MB total request
  • File API: 2GB per file, 20GB project quota
  • Retention: 48 hours auto-delete

Reference Navigation

For detailed implementation guidance, see:

Audio Processing

  • references/audio-processing.md - Transcription, analysis, TTS
    • Timestamp handling and segment analysis
    • Multi-speaker identification
    • Non-speech audio analysis
    • Text-to-speech generation

Image Understanding

  • references/vision-understanding.md - Captioning, detection, OCR
    • Object detection and localization
    • Pixel-level segmentation
    • Visual question answering
    • Multi-image comparison

Video Analysis

  • references/video-analysis.md - Scene detection, temporal understanding
    • YouTube URL processing
    • Timestamp-based queries
    • Video clipping and FPS control
    • Long video optimization

Document Extraction

  • references/document-extraction.md - PDF processing, structured output
    • Table and form extraction
    • Chart and diagram analysis
    • JSON schema validation
    • Multi-page handling

Image Generation

  • references/image-generation.md - Text-to-image, editing
    • Prompt engineering strategies
    • Image editing and composition
    • Aspect ratio selection
    • Safety settings

Cost Optimization

Token Costs

Input Pricing:

  • Gemini 2.5 Flash: $1.00/1M input, $0.10/1M output
  • Gemini 2.5 Pro: $3.00/1M input, $12.00/1M output
  • Gemini 1.5 Flash: $0.70/1M input, $0.175/1M output

Token Rates:

  • Audio: 32 tokens/second (1 min = 1,920 tokens)
  • Video: ~300 tokens/second (default) or ~100 (low-res)
  • PDF: 258 tokens/page (fixed)
  • Image: 258-1,548 tokens based on size

TTS Pricing:

  • Flash TTS: $10/1M tokens
  • Pro TTS: $20/1M tokens

Best Practices

  1. Use gemini-2.5-flash for most tasks (best price/performance)
  2. Use File API for files >20MB or repeated queries
  3. Optimize media before upload (see media_optimizer.py)
  4. Process specific segments instead of full videos
  5. Use lower FPS for static content
  6. Implement context caching for repeated queries
  7. Batch process multiple files in parallel

Rate Limits

Free Tier:

  • 10-15 RPM (requests per minute)
  • 1M-4M TPM (tokens per minute)
  • 1,500 RPD (requests per day)

YouTube Limits:

  • Free tier: 8 hours/day
  • Paid tier: No length limits
  • Public videos only

Storage Limits:

  • 20GB per project
  • 2GB per file
  • 48-hour retention

Error Handling

Common errors and solutions:

  • 400: Invalid format/size - validate before upload
  • 401: Invalid API key - check configuration
  • 403: Permission denied - verify API key restrictions
  • 404: File not found - ensure file uploaded and active
  • 429: Rate limit exceeded - implement exponential backoff
  • 500: Server error - retry with backoff

Scripts Overview

All scripts support unified API key detection and error handling:

gemini_batch_process.py: Batch process multiple media files

  • Supports all modalities (audio, image, video, PDF)
  • Progress tracking and error recovery
  • Output formats: JSON, Markdown, CSV
  • Rate limiting and retry logic
  • Dry-run mode

media_optimizer.py: Prepare media for Gemini API

  • Compress videos/audio for size limits
  • Resize images appropriately
  • Split long videos into chunks
  • Format conversion
  • Quality vs size optimization

document_converter.py: Convert documents to PDF

  • Convert DOCX, XLSX, PPTX to PDF
  • Extract page ranges
  • Optimize PDFs for Gemini
  • Extract images from PDFs
  • Batch conversion support

Run any script with --help for detailed usage.

Resources

Quick Install

/plugin add https://github.com/mrgoonie/claudekit-skills/tree/main/ai-multimodal

Copy and paste this command in Claude Code to install this skill

GitHub 仓库

mrgoonie/claudekit-skills
Path: .claude/skills/ai-multimodal

Related Skills

sglang

Meta

SGLang is a high-performance LLM serving framework that specializes in fast, structured generation for JSON, regex, and agentic workflows using its RadixAttention prefix caching. It delivers significantly faster inference, especially for tasks with repeated prefixes, making it ideal for complex, structured outputs and multi-turn conversations. Choose SGLang over alternatives like vLLM when you need constrained decoding or are building applications with extensive prefix sharing.

View skill

evaluating-llms-harness

Testing

This Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.

View skill

llamaguard

Other

LlamaGuard is Meta's 7-8B parameter model for moderating LLM inputs and outputs across six safety categories like violence and hate speech. It offers 94-95% accuracy and can be deployed using vLLM, Hugging Face, or Amazon SageMaker. Use this skill to easily integrate content filtering and safety guardrails into your AI applications.

View skill

langchain

Meta

LangChain is a framework for building LLM applications using agents, chains, and RAG pipelines. It supports multiple LLM providers, offers 500+ integrations, and includes features like tool calling and memory management. Use it for rapid prototyping and deploying production systems like chatbots, autonomous agents, and question-answering services.

View skill