MCP HubMCP Hub
返回技能列表

gemini-document-processing

Elios-FPT
更新于 Today
21 次查看
1
在 GitHub 上查看
设计pdfwordapidesigndata

关于

This skill provides a guide for implementing Google Gemini API to process PDF documents using native vision capabilities. It enables developers to extract text, images, diagrams, charts, and tables, and supports tasks like structured data extraction, summarization, and document Q&A. Use it when you need to analyze complex documents and convert their content into structured formats.

技能文档

Gemini Document Processing

Process and analyze PDF documents using Google Gemini's native vision capabilities. Extract structured information, summarize content, answer questions, and understand complex documents with text, images, diagrams, charts, and tables.

Core Capabilities

  • PDF Vision Processing: Native understanding of PDFs up to 1,000 pages (258 tokens/page)
  • Multimodal Analysis: Process text, images, diagrams, charts, and tables
  • Structured Extraction: Output to JSON with schema validation
  • Document Q&A: Answer questions based on document content
  • Summarization: Generate summaries preserving context
  • Format Conversion: Transcribe to HTML while preserving layout

When to Use This Skill

Use this skill when you need to:

  • Extract structured data from PDF documents (invoices, resumes, forms)
  • Summarize long documents or reports
  • Answer questions about PDF content
  • Analyze documents with complex layouts, charts, or diagrams
  • Convert PDFs to structured formats (JSON, HTML)
  • Process multiple documents in batch
  • Build document processing pipelines

Quick Setup

1. API Key Configuration

The skill supports both Google AI Studio and Vertex AI endpoints.

Option 1: Google AI Studio (Default)

The skill checks for GEMINI_API_KEY in this priority order:

  1. Process environment variable
  2. Project root .env
  3. .claude/.env
  4. .claude/skills/.env
  5. .env file in skill directory (.claude/skills/gemini-document-processing/.env)

Get your API key: https://aistudio.google.com/apikey

Environment Variable (Recommended)

export GEMINI_API_KEY="your-api-key-here"

Or in .env file:

echo "GEMINI_API_KEY=your-api-key-here" > .env

Option 2: Vertex AI

To use Vertex AI instead:

# Enable Vertex AI
export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1  # Optional, defaults to us-central1

Or in .env file:

GEMINI_USE_VERTEX=true
VERTEX_PROJECT_ID=your-gcp-project-id
VERTEX_LOCATION=us-central1

2. Install Dependencies

pip install google-genai python-dotenv

Common Use Cases

1. Extract Structured Data from PDF

# Use the provided script
python .claude/skills/gemini-document-processing/scripts/process-document.py \
  --file invoice.pdf \
  --prompt "Extract invoice details as JSON" \
  --format json

2. Summarize Long Document

# Process and summarize
python .claude/skills/gemini-document-processing/scripts/process-document.py \
  --file report.pdf \
  --prompt "Provide a concise executive summary"

3. Answer Questions About Document

# Q&A on document content
python .claude/skills/gemini-document-processing/scripts/process-document.py \
  --file contract.pdf \
  --prompt "What are the key terms and conditions?"

4. Process with Python SDK

from google import genai

client = genai.Client()

# Read PDF
with open('document.pdf', 'rb') as f:
    pdf_data = f.read()

# Process document
response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Extract key information from this document',
        genai.types.Part.from_bytes(
            data=pdf_data,
            mime_type='application/pdf'
        )
    ]
)

print(response.text)

5. Structured Output with JSON Schema

from google import genai
from pydantic import BaseModel

class InvoiceData(BaseModel):
    invoice_number: str
    date: str
    total: float
    vendor: str

client = genai.Client()

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Extract invoice details',
        genai.types.Part.from_bytes(
            data=open('invoice.pdf', 'rb').read(),
            mime_type='application/pdf'
        )
    ],
    config=genai.types.GenerateContentConfig(
        response_mime_type='application/json',
        response_schema=InvoiceData
    )
)

invoice_data = InvoiceData.model_validate_json(response.text)

Key Constraints

  • Format: Only PDFs get vision processing (TXT, HTML, Markdown are text-only)
  • Size: < 20MB use inline encoding, > 20MB use File API
  • Pages: Max 1,000 pages per document
  • Storage: File API stores for 48 hours only
  • Cost: 258 tokens per page (fixed, regardless of content density)

Performance Tips

  1. Use Inline Encoding for PDFs < 20MB (simpler, single request)
  2. Use File API for larger files or repeated queries (enables context caching)
  3. Place Prompt After PDF for single-page documents
  4. Use Context Caching when querying same PDF multiple times
  5. Process in Parallel for multiple independent documents
  6. Use gemini-2.5-flash for best price/performance ratio

Decision Guide

PDF < 20MB?
├─ Yes → Use inline base64 encoding
└─ No  → Use File API

Need structured JSON output?
├─ Yes → Define response_schema with Pydantic
└─ No  → Get text response

Multiple queries on same PDF?
├─ Yes → Use File API + Context Caching
└─ No  → Inline encoding is sufficient

Script Reference

The skill includes a ready-to-use processing script:

# Basic usage
python scripts/process-document.py --file document.pdf --prompt "Your prompt"

# With JSON output
python scripts/process-document.py --file document.pdf --prompt "Extract data" --format json

# With File API (for large files)
python scripts/process-document.py --file large-document.pdf --prompt "Summarize" --use-file-api

# Multiple prompts
python scripts/process-document.py --file document.pdf --prompt "Question 1" --prompt "Question 2"

References

For comprehensive documentation, see:

  • references/gemini-document-processing-report.md - Complete API reference
  • references/quick-reference.md - Quick lookup guide
  • references/code-examples.md - Additional code patterns

Troubleshooting

API Key Not Found:

# Check API key is set
./scripts/check-api-key.sh

File Too Large:

  • Use File API for files > 20MB
  • Add --use-file-api flag to the script

Vision Not Working:

  • Ensure file is PDF format
  • Other formats (TXT, HTML) don't support vision processing

Support

快速安装

/plugin add https://github.com/Elios-FPT/EliosCodePracticeService/tree/main/gemini-document-processing

在 Claude Code 中复制并粘贴此命令以安装该技能

GitHub 仓库

Elios-FPT/EliosCodePracticeService
路径: .claude/skills/gemini-document-processing

相关推荐技能

evaluating-llms-harness

测试

该Skill通过60+个学术基准测试(如MMLU、GSM8K等)评估大语言模型质量,适用于模型对比、学术研究及训练进度追踪。它支持HuggingFace、vLLM和API接口,被EleutherAI等行业领先机构广泛采用。开发者可通过简单命令行快速对模型进行多任务批量评估。

查看技能

langchain

LangChain是一个用于构建LLM应用程序的框架,支持智能体、链和RAG应用开发。它提供多模型提供商支持、500+工具集成、记忆管理和向量检索等核心功能。开发者可用它快速构建聊天机器人、问答系统和自主代理,适用于从原型验证到生产部署的全流程。

查看技能

go-test

go-test Skill为Go开发者提供全面的测试指导,涵盖单元测试、性能基准测试和集成测试的最佳实践。它能帮助您正确实现表驱动测试、子测试组织、mock接口和竞态检测,同时指导测试覆盖率分析和性能基准测试。当您编写_test.go文件、设计测试用例或优化测试策略时,这个Skill能确保您遵循Go语言的标准测试惯例。

查看技能

project-structure

这个Skill为开发者提供全面的项目目录结构设计指南和最佳实践。它涵盖了多种项目类型包括monorepo、前后端框架、库和扩展的标准组织结构。帮助团队创建可扩展、易维护的代码架构,特别适用于新项目设计、遗留项目迁移和团队规范制定。

查看技能