gemini-audio
关于
This skill enables developers to implement Google Gemini API's audio capabilities for both analysis and generation. It can transcribe, summarize, and analyze audio files up to 9.5 hours long, as well as generate natural speech from text with controllable TTS. Use it for processing podcasts, meetings, or any project requiring robust audio-to-text or text-to-speech functionality.
技能文档
Gemini Audio API Skill
Process audio with transcription, analysis, and understanding, plus generate natural speech using Google's Gemini API. Supports up to 9.5 hours of audio per request with multiple formats.
When to Use This Skill
Use this skill when you need to:
- Transcribe audio files to text with timestamps
- Summarize audio content and extract key points
- Analyze speech, music, or environmental sounds
- Generate speech from text with controllable voice and style
- Process podcasts, interviews, meetings, or any audio content
- Understand non-speech audio (birdsong, sirens, music)
Prerequisites
API Key Setup
The skill supports both Google AI Studio and Vertex AI endpoints.
Option 1: Google AI Studio (Default)
The skill automatically detects your GEMINI_API_KEY in this order:
- Process environment:
export GEMINI_API_KEY="your-key" - Project root:
.env - .claude directory:
.claude/.env - .claude/skills directory:
.claude/skills/.env - Skill directory:
.claude/skills/gemini-audio/.env
Get your API key: Visit Google AI Studio
Create .env file with:
GEMINI_API_KEY=your_api_key_here
Option 2: Vertex AI
To use Vertex AI instead:
# Enable Vertex AI
export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1 # Optional, defaults to us-central1
Or in .env file:
GEMINI_USE_VERTEX=true
VERTEX_PROJECT_ID=your-gcp-project-id
VERTEX_LOCATION=us-central1
Python Setup
Install required package:
pip install google-genai
Quick Start
Audio Analysis (Transcription, Summarization)
from google import genai
import os
# API key auto-detected from environment
client = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))
# Upload audio file
myfile = client.files.upload(file='podcast.mp3')
# Transcribe
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=['Generate a transcript of the speech.', myfile]
)
print(response.text)
# Summarize
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=['Summarize the key points in 5 bullets.', myfile]
)
print(response.text)
Using Helper Scripts
# Transcribe audio
python .claude/skills/gemini-audio/scripts/transcribe.py audio.mp3
# Summarize audio
python .claude/skills/gemini-audio/scripts/analyze.py audio.mp3 \
"Summarize key points"
# Analyze specific segment (timestamps in MM:SS format)
python .claude/skills/gemini-audio/scripts/analyze.py audio.mp3 \
"What is discussed from 02:30 to 05:15?"
# Generate speech
python .claude/skills/gemini-audio/scripts/generate-speech.py \
"Welcome to our podcast" \
--output welcome.wav
Audio Understanding Capabilities
Supported Formats
| Format | MIME Type | Best Use |
|---|---|---|
| WAV | audio/wav | Uncompressed, highest quality |
| MP3 | audio/mp3 | Compressed, widely compatible |
| AAC | audio/aac | Compressed, good quality |
| FLAC | audio/flac | Lossless compression |
| OGG Vorbis | audio/ogg | Open format |
| AIFF | audio/aiff | Apple format |
Audio Specifications
- Maximum length: 9.5 hours per request
- Multiple files: Unlimited count, combined max 9.5 hours
- Token rate: 32 tokens/second (1 minute = 1,920 tokens)
- Processing: Auto-downsampled to 16 Kbps mono
- File size limits:
- Inline: 20 MB max total request
- File API: 2 GB per file, 20 GB project quota
- Retention: 48 hours auto-delete
Analysis Features
- Transcription: Full text with punctuation
- Timestamps: Reference segments (MM:SS format)
- Multi-speaker: Identify different speakers
- Non-speech: Analyze music, sounds, ambient audio
- Languages: Support for multiple languages
Speech Generation (TTS)
Available TTS Models
| Model | Quality | Speed | Cost/1M tokens |
|---|---|---|---|
gemini-2.5-flash-native-audio-preview-09-2025 | High | Fast | $10 |
gemini-2.5-pro TTS mode | Premium | Slower | $20 |
Controllable Voice Options
- Style: Professional, casual, narrative, conversational
- Pace: Slow, normal, fast
- Tone: Friendly, serious, enthusiastic
- Accent: Natural language control
TTS Example
response = client.models.generate_content(
model='gemini-2.5-flash-native-audio-preview-09-2025',
contents='Generate audio: Welcome to today\'s episode, in a warm, friendly tone.'
)
# Save audio output
with open('output.wav', 'wb') as f:
f.write(response.audio_data)
Input Methods
Method 1: File Upload (Recommended for >20MB)
# Upload and reuse
myfile = client.files.upload(file='large-audio.mp3')
# Use file multiple times
response1 = client.models.generate_content(
model='gemini-2.5-flash',
contents=['Transcribe this', myfile]
)
response2 = client.models.generate_content(
model='gemini-2.5-flash',
contents=['Summarize this', myfile]
)
Method 2: Inline Data (<20MB)
from google.genai import types
with open('small-audio.mp3', 'rb') as f:
audio_bytes = f.read()
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'Describe this audio',
types.Part.from_bytes(data=audio_bytes, mime_type='audio/mp3')
]
)
Common Use Cases
Transcription
python scripts/transcribe.py meeting.mp3 --include-timestamps
Summary with Key Points
python scripts/analyze.py interview.wav "Extract main topics and key quotes"
Speaker Identification
python scripts/analyze.py discussion.mp3 "Identify speakers and extract dialogue"
Segment Analysis
python scripts/analyze.py podcast.mp3 "Summarize content from 10:30 to 15:45"
Non-Speech Analysis
python scripts/analyze.py ambient.wav "Identify all sounds: voices, music, ambient"
Best Practices
File Management
- Use File API for files >20MB or repeated usage
- Files auto-delete after 48 hours
- Manage quota (20 GB project limit)
Prompt Engineering
- Be specific: "Transcribe from 02:30 to 03:29"
- Use timestamps for segment analysis (MM:SS format)
- Combine tasks: "Transcribe and summarize"
- Provide context: "This is a medical interview"
Cost Optimization
- Use
gemini-2.5-flash($1/1M tokens) for most tasks - Upgrade to
gemini-2.5-pro($3/1M tokens) for complex analysis - Check token count: 1 min audio = 1,920 tokens
Error Handling
- Validate file format and size before upload
- Implement exponential backoff for rate limits
- Handle 48-hour file expiration
Token Costs & Pricing
Audio Input (32 tokens/second):
- 1 minute = 1,920 tokens
- 1 hour = 115,200 tokens
- 9.5 hours = 1,094,400 tokens
Model Pricing:
- Gemini 2.5 Flash: $1.00/1M input, $0.10/1M output
- Gemini 2.5 Pro: $3.00/1M input, $12.00/1M output
- Gemini 1.5 Flash: $0.70/1M input, $0.175/1M output
TTS Pricing:
- Flash TTS: $10/1M tokens
- Pro TTS: $20/1M tokens
Reference Documentation
For detailed information, see:
references/api-reference.md- Complete API specificationsreferences/code-examples.md- Comprehensive code examplesreferences/tts-guide.md- Text-to-speech implementation guidereferences/best-practices.md- Advanced optimization strategies
Scripts Overview
All scripts support 3-step API key detection:
- transcribe.py: Generate transcripts with optional timestamps
- analyze.py: General audio analysis with custom prompts
- generate-speech.py: Text-to-speech generation
- manage-files.py: Upload, list, and delete audio files
Run any script with --help for detailed usage.
Resources
快速安装
/plugin add https://github.com/Elios-FPT/EliosCodePracticeService/tree/main/gemini-audio在 Claude Code 中复制并粘贴此命令以安装该技能
GitHub 仓库
相关推荐技能
evaluating-llms-harness
测试该Skill通过60+个学术基准测试(如MMLU、GSM8K等)评估大语言模型质量,适用于模型对比、学术研究及训练进度追踪。它支持HuggingFace、vLLM和API接口,被EleutherAI等行业领先机构广泛采用。开发者可通过简单命令行快速对模型进行多任务批量评估。
langchain
元LangChain是一个用于构建LLM应用程序的框架,支持智能体、链和RAG应用开发。它提供多模型提供商支持、500+工具集成、记忆管理和向量检索等核心功能。开发者可用它快速构建聊天机器人、问答系统和自主代理,适用于从原型验证到生产部署的全流程。
project-structure
元这个Skill为开发者提供全面的项目目录结构设计指南和最佳实践。它涵盖了多种项目类型包括monorepo、前后端框架、库和扩展的标准组织结构。帮助团队创建可扩展、易维护的代码架构,特别适用于新项目设计、遗留项目迁移和团队规范制定。
issue-documentation
元该Skill为开发者提供标准化的issue文档模板和指南,适用于创建bug报告、GitHub/Linear/Jira问题等场景。它能系统化地记录问题状况、复现步骤、根本原因、解决方案和影响范围,确保团队沟通清晰高效。通过实施主流问题跟踪系统的最佳实践,帮助开发者生成结构完整的故障排除文档和事件报告。
