MCP HubMCP Hub
返回技能列表

gemini-vision

Elios-FPT
更新于 Today
18 次查看
1
在 GitHub 上查看
测试wordtestingapidesign

关于

The gemini-vision skill enables Claude to implement Google's Gemini API for advanced image analysis. It provides capabilities for image captioning, classification, visual QA, object detection, segmentation, and multi-image comparison. Use this skill when building applications that require processing images, answering visual questions, or detecting objects in visual content.

技能文档

Gemini Vision API Skill

This skill enables Claude to use Google's Gemini API for advanced image understanding tasks including captioning, classification, visual question answering, object detection, segmentation, and multi-image analysis.

Quick Start

Prerequisites

  1. Get API Key: Obtain from Google AI Studio
  2. Install SDK: pip install google-genai (Python 3.9+)
  • If pip is not installed, instructs user to install it first.

API Key Configuration

The skill supports both Google AI Studio and Vertex AI endpoints.

Option 1: Google AI Studio (Default)

The skill checks for GEMINI_API_KEY in this order:

  1. Process environment: export GEMINI_API_KEY="your-key"
  2. Project root: .env
  3. .claude directory: .claude/.env
  4. .claude/skills directory: .claude/skills/.env
  5. Skill directory: .claude/skills/gemini-vision/.env

Get your API key: Visit Google AI Studio

Option 2: Vertex AI

To use Vertex AI instead:

# Enable Vertex AI
export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1  # Optional, defaults to us-central1

Or in .env file:

GEMINI_USE_VERTEX=true
VERTEX_PROJECT_ID=your-gcp-project-id
VERTEX_LOCATION=us-central1

Security: Never commit API keys to version control. Add .env to .gitignore.

Core Capabilities

Image Analysis

  • Captioning: Generate descriptive text for images
  • Classification: Categorize and identify image content
  • Visual QA: Answer questions about image content
  • Multi-image: Compare and analyze up to 3,600 images

Advanced Features (Model-Specific)

  • Object Detection: Identify and locate objects with bounding boxes (Gemini 2.0+)
  • Segmentation: Create pixel-level masks for objects (Gemini 2.5+)
  • Document Understanding: Process PDFs with vision (up to 1,000 pages)

Supported Formats

  • Images: PNG, JPEG, WEBP, HEIC, HEIF
  • Documents: PDF (up to 1,000 pages)
  • Size Limits:
    • Inline: 20MB max total request size
    • File API: For larger files
    • Max images: 3,600 per request

Available Models

  • gemini-2.5-pro: Most capable, segmentation + detection
  • gemini-2.5-flash: Fast, efficient, segmentation + detection
  • gemini-2.5-flash-lite: Lightweight, segmentation + detection
  • gemini-2.0-flash: Object detection support
  • gemini-1.5-pro/flash: Previous generation

Usage Examples

Basic Image Analysis

# Analyze a local image
python scripts/analyze-image.py path/to/image.jpg "What's in this image?"

# Analyze from URL
python scripts/analyze-image.py https://example.com/image.jpg "Describe this"

# Specify model
python scripts/analyze-image.py image.jpg "Caption this" --model gemini-2.5-pro

Object Detection (2.0+)

python scripts/analyze-image.py image.jpg "Detect all objects" --model gemini-2.0-flash

Multi-Image Comparison

python scripts/analyze-image.py img1.jpg img2.jpg "What's different between these?"

File Upload (for large files or reuse)

# Upload file
python scripts/upload-file.py path/to/large-image.jpg

# Use uploaded file
python scripts/analyze-image.py file://file-id "Caption this"

File Management

# List uploaded files
python scripts/manage-files.py list

# Get file info
python scripts/manage-files.py get file-id

# Delete file
python scripts/manage-files.py delete file-id

Token Costs

Images consume tokens based on size:

  • Small (≤384px both dimensions): 258 tokens
  • Large: Tiled into 768×768 chunks, 258 tokens each

Token Formula:

crop_unit = floor(min(width, height) / 1.5)
tiles = (width / crop_unit) × (height / crop_unit)
total_tokens = tiles × 258

Example: 960×540 image = 6 tiles = 1,548 tokens

Rate Limits

Limits vary by tier (Free, Tier 1, 2, 3):

  • Measured in RPM (requests/min), TPM (tokens/min), RPD (requests/day)
  • Applied per project, not per API key
  • RPD resets at midnight Pacific

Best Practices

Image Quality

  • Use clear, non-blurry images
  • Verify correct image rotation
  • Consider token costs when sizing

Prompting

  • Be specific in instructions
  • Place text after image for single-image prompts
  • Use few-shot examples for better accuracy
  • Specify output format (JSON, markdown, etc.)

File Management

  • Use File API for files >20MB
  • Use File API for repeated usage (saves tokens)
  • Files auto-delete after 48 hours
  • Clean up manually when done

Security

  • Never expose API keys in code
  • Use environment variables
  • Add API key restrictions in Google Cloud Console
  • Monitor usage regularly
  • Rotate keys periodically

Error Handling

Common errors:

  • 401: Invalid API key
  • 429: Rate limit exceeded
  • 400: Invalid request (check file size, format)
  • 403: Permission denied (check API key restrictions)

Additional Resources

See the references/ directory for:

  • api-reference.md: Detailed API methods and endpoints
  • examples.md: Comprehensive code examples
  • best-practices.md: Advanced tips and optimization strategies

Implementation Guide

When implementing Gemini vision features:

  1. Check API key availability using the 3-step lookup
    • If no key is found, fall back to the workspace default vision model.
    • If the default model is missing or unavailable, surface a clear message to the user explaining the absence and next steps to configure either an API key or model.
  2. Choose appropriate model based on requirements:
    • Need segmentation? Use 2.5+ models
    • Need detection? Use 2.0+ models
    • Need speed? Use Flash variants
    • Need quality? Use Pro variants
  3. Validate inputs:
    • Check file format (PNG, JPEG, WEBP, HEIC, HEIF, PDF)
    • Verify file size (<20MB for inline, >20MB use File API)
    • Count images (max 3,600)
  4. Handle responses appropriately:
    • Parse structured output if requested
    • Extract bounding boxes for object detection
    • Process segmentation masks if applicable
  5. Manage files efficiently:
    • Upload large files via File API
    • Reuse uploaded files when possible
    • Clean up after use

Scripts Overview

All scripts support the 3-step API key lookup:

  • analyze-image.py: Main script for image analysis, supports inline and File API
  • upload-file.py: Upload files to Gemini File API
  • manage-files.py: List, get metadata, and delete uploaded files

Run any script with --help for detailed usage instructions.


Official Documentation: https://ai.google.dev/gemini-api/docs/image-understanding

快速安装

/plugin add https://github.com/Elios-FPT/EliosCodePracticeService/tree/main/gemini-vision

在 Claude Code 中复制并粘贴此命令以安装该技能

GitHub 仓库

Elios-FPT/EliosCodePracticeService
路径: .claude/skills/gemini-vision

相关推荐技能

evaluating-llms-harness

测试

该Skill通过60+个学术基准测试(如MMLU、GSM8K等)评估大语言模型质量,适用于模型对比、学术研究及训练进度追踪。它支持HuggingFace、vLLM和API接口,被EleutherAI等行业领先机构广泛采用。开发者可通过简单命令行快速对模型进行多任务批量评估。

查看技能

langchain

LangChain是一个用于构建LLM应用程序的框架,支持智能体、链和RAG应用开发。它提供多模型提供商支持、500+工具集成、记忆管理和向量检索等核心功能。开发者可用它快速构建聊天机器人、问答系统和自主代理,适用于从原型验证到生产部署的全流程。

查看技能

go-test

go-test Skill为Go开发者提供全面的测试指导,涵盖单元测试、性能基准测试和集成测试的最佳实践。它能帮助您正确实现表驱动测试、子测试组织、mock接口和竞态检测,同时指导测试覆盖率分析和性能基准测试。当您编写_test.go文件、设计测试用例或优化测试策略时,这个Skill能确保您遵循Go语言的标准测试惯例。

查看技能

project-structure

这个Skill为开发者提供全面的项目目录结构设计指南和最佳实践。它涵盖了多种项目类型包括monorepo、前后端框架、库和扩展的标准组织结构。帮助团队创建可扩展、易维护的代码架构,特别适用于新项目设计、遗留项目迁移和团队规范制定。

查看技能