gemini-document-processing
について
このスキルは、Google Gemini APIを活用してPDFドキュメントをネイティブのビジョン機能で処理する実装ガイドを提供します。開発者はテキスト、画像、図表、グラフ、表を抽出でき、構造化データ抽出、要約、ドキュメントQ&Aなどのタスクをサポートします。複雑なドキュメントを分析し、その内容を構造化形式に変換する必要がある場合にご利用ください。
クイックインストール
Claude Code
推奨/plugin add https://github.com/Elios-FPT/EliosCodePracticeServicegit clone https://github.com/Elios-FPT/EliosCodePracticeService.git ~/.claude/skills/gemini-document-processingこのコマンドをClaude Codeにコピー&ペーストしてスキルをインストールします
ドキュメント
Gemini Document Processing
Process and analyze PDF documents using Google Gemini's native vision capabilities. Extract structured information, summarize content, answer questions, and understand complex documents with text, images, diagrams, charts, and tables.
Core Capabilities
- PDF Vision Processing: Native understanding of PDFs up to 1,000 pages (258 tokens/page)
- Multimodal Analysis: Process text, images, diagrams, charts, and tables
- Structured Extraction: Output to JSON with schema validation
- Document Q&A: Answer questions based on document content
- Summarization: Generate summaries preserving context
- Format Conversion: Transcribe to HTML while preserving layout
When to Use This Skill
Use this skill when you need to:
- Extract structured data from PDF documents (invoices, resumes, forms)
- Summarize long documents or reports
- Answer questions about PDF content
- Analyze documents with complex layouts, charts, or diagrams
- Convert PDFs to structured formats (JSON, HTML)
- Process multiple documents in batch
- Build document processing pipelines
Quick Setup
1. API Key Configuration
The skill supports both Google AI Studio and Vertex AI endpoints.
Option 1: Google AI Studio (Default)
The skill checks for GEMINI_API_KEY in this priority order:
- Process environment variable
- Project root
.env .claude/.env.claude/skills/.env.envfile in skill directory (.claude/skills/gemini-document-processing/.env)
Get your API key: https://aistudio.google.com/apikey
Environment Variable (Recommended)
export GEMINI_API_KEY="your-api-key-here"
Or in .env file:
echo "GEMINI_API_KEY=your-api-key-here" > .env
Option 2: Vertex AI
To use Vertex AI instead:
# Enable Vertex AI
export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1 # Optional, defaults to us-central1
Or in .env file:
GEMINI_USE_VERTEX=true
VERTEX_PROJECT_ID=your-gcp-project-id
VERTEX_LOCATION=us-central1
2. Install Dependencies
pip install google-genai python-dotenv
Common Use Cases
1. Extract Structured Data from PDF
# Use the provided script
python .claude/skills/gemini-document-processing/scripts/process-document.py \
--file invoice.pdf \
--prompt "Extract invoice details as JSON" \
--format json
2. Summarize Long Document
# Process and summarize
python .claude/skills/gemini-document-processing/scripts/process-document.py \
--file report.pdf \
--prompt "Provide a concise executive summary"
3. Answer Questions About Document
# Q&A on document content
python .claude/skills/gemini-document-processing/scripts/process-document.py \
--file contract.pdf \
--prompt "What are the key terms and conditions?"
4. Process with Python SDK
from google import genai
client = genai.Client()
# Read PDF
with open('document.pdf', 'rb') as f:
pdf_data = f.read()
# Process document
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'Extract key information from this document',
genai.types.Part.from_bytes(
data=pdf_data,
mime_type='application/pdf'
)
]
)
print(response.text)
5. Structured Output with JSON Schema
from google import genai
from pydantic import BaseModel
class InvoiceData(BaseModel):
invoice_number: str
date: str
total: float
vendor: str
client = genai.Client()
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'Extract invoice details',
genai.types.Part.from_bytes(
data=open('invoice.pdf', 'rb').read(),
mime_type='application/pdf'
)
],
config=genai.types.GenerateContentConfig(
response_mime_type='application/json',
response_schema=InvoiceData
)
)
invoice_data = InvoiceData.model_validate_json(response.text)
Key Constraints
- Format: Only PDFs get vision processing (TXT, HTML, Markdown are text-only)
- Size: < 20MB use inline encoding, > 20MB use File API
- Pages: Max 1,000 pages per document
- Storage: File API stores for 48 hours only
- Cost: 258 tokens per page (fixed, regardless of content density)
Performance Tips
- Use Inline Encoding for PDFs < 20MB (simpler, single request)
- Use File API for larger files or repeated queries (enables context caching)
- Place Prompt After PDF for single-page documents
- Use Context Caching when querying same PDF multiple times
- Process in Parallel for multiple independent documents
- Use gemini-2.5-flash for best price/performance ratio
Decision Guide
PDF < 20MB?
├─ Yes → Use inline base64 encoding
└─ No → Use File API
Need structured JSON output?
├─ Yes → Define response_schema with Pydantic
└─ No → Get text response
Multiple queries on same PDF?
├─ Yes → Use File API + Context Caching
└─ No → Inline encoding is sufficient
Script Reference
The skill includes a ready-to-use processing script:
# Basic usage
python scripts/process-document.py --file document.pdf --prompt "Your prompt"
# With JSON output
python scripts/process-document.py --file document.pdf --prompt "Extract data" --format json
# With File API (for large files)
python scripts/process-document.py --file large-document.pdf --prompt "Summarize" --use-file-api
# Multiple prompts
python scripts/process-document.py --file document.pdf --prompt "Question 1" --prompt "Question 2"
References
For comprehensive documentation, see:
references/gemini-document-processing-report.md- Complete API referencereferences/quick-reference.md- Quick lookup guidereferences/code-examples.md- Additional code patterns
Troubleshooting
API Key Not Found:
# Check API key is set
./scripts/check-api-key.sh
File Too Large:
- Use File API for files > 20MB
- Add
--use-file-apiflag to the script
Vision Not Working:
- Ensure file is PDF format
- Other formats (TXT, HTML) don't support vision processing
Support
- API Documentation: https://ai.google.dev/gemini-api/docs/document-processing
- Get API Key: https://aistudio.google.com/apikey
- Model Info: https://ai.google.dev/gemini-api/docs/models/gemini
GitHub リポジトリ
関連スキル
content-collections
メタThis skill provides a production-tested setup for Content Collections, a TypeScript-first tool that transforms Markdown/MDX files into type-safe data collections with Zod validation. Use it when building blogs, documentation sites, or content-heavy Vite + React applications to ensure type safety and automatic content validation. It covers everything from Vite plugin configuration and MDX compilation to deployment optimization and schema validation.
creating-opencode-plugins
メタThis skill provides the structure and API specifications for creating OpenCode plugins that hook into 25+ event types like commands, files, and LSP operations. It offers implementation patterns for JavaScript/TypeScript modules that intercept and extend the AI assistant's lifecycle. Use it when you need to build event-driven plugins for monitoring, custom handling, or extending OpenCode's capabilities.
evaluating-llms-harness
テストThis Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.
polymarket
メタThis skill enables developers to build applications with the Polymarket prediction markets platform, including API integration for trading and market data. It also provides real-time data streaming via WebSocket to monitor live trades and market activity. Use it for implementing trading strategies or creating tools that process live market updates.
