acestep
关于
The acestep skill enables AI-powered music generation and audio processing using ACE-Step 1.5 models. It handles background music creation, vocal tracks, covers, stem extraction, audio repainting, and continuation for video production workflows. Developers should use it when triggered by music-related tasks like soundtrack generation, jingle creation, or audio stem manipulation.
快速安装
Claude Code
推荐npx skills add digitalsamba/claude-code-video-toolkit -a claude-code/plugin add https://github.com/digitalsamba/claude-code-video-toolkitgit clone https://github.com/digitalsamba/claude-code-video-toolkit.git ~/.claude/skills/acestep在 Claude Code 中复制并粘贴此命令以安装该技能
技能文档
ACE-Step 1.5 Music Generation
Open-source music generation via tools/music_gen.py.
Cloud providers:
- acemusic (default) — Official ACE-Step cloud API with XL Turbo (4B) model + 5Hz LM thinking mode. Free API key from acemusic.ai/api-key. No GPU required.
- modal — Self-hosted ACE-Step 2B Turbo on Modal. Requires
MODAL_MUSIC_GEN_ENDPOINT_URL. - runpod — Self-hosted ACE-Step 2B Turbo on RunPod. Requires
RUNPOD_ACESTEP_ENDPOINT_ID.
Setup
# acemusic (recommended — free, best quality, no GPU)
echo "ACEMUSIC_API_KEY=your_key" >> .env
# Get key at https://acemusic.ai/api-key
# Self-hosted (optional fallback)
python tools/music_gen.py --setup # RunPod
modal deploy docker/modal-music-gen/app.py # Modal
Quick Reference
# Basic generation (uses acemusic XL Turbo by default)
python tools/music_gen.py --prompt "Upbeat tech corporate" --duration 60 --output bg.mp3
# Generate 4 variations, pick the best
python tools/music_gen.py --prompt "Calm ambient piano" --duration 30 --variations 4 --output ambient.mp3
# Fast mode (disable thinking)
python tools/music_gen.py --no-thinking --prompt "Quick draft" --duration 30 --output draft.mp3
# With musical control
python tools/music_gen.py --prompt "Calm ambient piano" --duration 30 --bpm 72 --key "D Major" --output ambient.mp3
# Scene presets (video production)
python tools/music_gen.py --preset corporate-bg --duration 60 --output bg.mp3
python tools/music_gen.py --preset tension --duration 20 --output problem.mp3
python tools/music_gen.py --preset cta --brand digital-samba --duration 15 --output cta.mp3
# Vocals with lyrics
python tools/music_gen.py --prompt "Indie pop jingle" --lyrics "[verse]\nBuild it better\nShip it faster" --duration 30 --output jingle.mp3
# Cover / style transfer
python tools/music_gen.py --cover --reference theme.mp3 --prompt "Jazz piano version" --duration 60 --output jazz_cover.mp3
# Repaint a weak section
python tools/music_gen.py --repaint --input track.mp3 --repaint-start 15 --repaint-end 25 --prompt "Guitar solo" --output fixed.mp3
# Continue from existing audio
python tools/music_gen.py --continuation --input track.mp3 --prompt "Continue with jazz piano" --output extended.mp3
# Stem extraction
python tools/music_gen.py --extract vocals --input mixed.mp3 --output vocals.mp3
# Fall back to self-hosted
python tools/music_gen.py --cloud modal --prompt "Background music" --duration 60 --output bg.mp3
Fixing "Samey" Output
If generated music sounds repetitive or lacks variety, try these in order:
- Use acemusic cloud (default) — the XL Turbo 4B model is significantly more capable than the 2B model on Modal/RunPod
- Keep thinking mode on (default for acemusic) — the 5Hz LM enriches sparse prompts into detailed musical descriptions
- Generate variations —
--variations 4generates 4 takes, pick the best - Use stochastic inference —
--infer-method sdeadds randomness (same seed gives different results) - Vary BPM and key across scenes — don't use the same preset for every scene
- Write sparser prompts — "Upbeat indie rock" gives the model more creative freedom than a hyper-detailed description
- Vary seeds — omit
--seedto let each generation be unique
Creating a Song (Step by Step)
1. Instrumental background track (simplest)
python tools/music_gen.py --prompt "Upbeat indie rock, driving drums, jangly guitar" --duration 60 --bpm 120 --key "G Major" --output track.mp3
2. Song with vocals and lyrics
Write lyrics in a temp file or pass inline. Use structure tags to control song sections.
# Write lyrics to a file first (recommended for longer songs)
cat > /tmp/lyrics.txt << 'LYRICS'
[Verse 1]
Walking through the morning light
Coffee in my hand feels right
Another day to build and dream
Nothing's ever what it seems
[Chorus - anthemic]
WE KEEP MOVING FORWARD
Through the noise and doubt
We keep moving forward
That's what it's about
[Verse 2]
Screens are glowing late at night
Shipping code until it's right
The deadline's close but so are we
Almost there, just wait and see
[Chorus - bigger]
WE KEEP MOVING FORWARD
Through the noise and doubt
We keep moving forward
That's what it's about
[Outro - fade]
(Moving forward...)
LYRICS
# Generate the song
python tools/music_gen.py \
--prompt "Upbeat indie rock anthem, male vocal, driving drums, electric guitar, studio polish" \
--lyrics "$(cat /tmp/lyrics.txt)" \
--duration 60 \
--bpm 128 \
--key "G Major" \
--output my_song.mp3
3. Repaint a weak section
If the chorus sounds weak, regenerate just that section:
python tools/music_gen.py --repaint --input my_song.mp3 --repaint-start 20 --repaint-end 35 --prompt "Powerful anthemic chorus, big drums" --output fixed.mp3
4. Continue/extend a track
python tools/music_gen.py --continuation --input my_song.mp3 --prompt "Continue with gentle acoustic outro" --output extended.mp3
Key tips for good results
- Caption = overall style (genre, instruments, mood, production quality)
- Lyrics = temporal structure (verse/chorus flow, vocal delivery)
- UPPERCASE in lyrics = high vocal intensity
- Parentheses = background vocals: "We rise (together)"
- Keep 6-10 syllables per line for natural rhythm
- Don't describe the melody in the caption — describe the sound and feeling
- Use
--seedto lock randomness when iterating on prompt/lyrics
Controlling vocal gender
The model doesn't reliably follow "female vocal" or "male vocal" on its own. Use both of these together:
- In the prompt: Be explicit — "solo female singer, alto voice" or "female vocalist only, breathy intimate voice". Adding an artist reference helps (e.g., "Brandi Carlile style").
- In the lyrics: Add
[female vocal]tags before each section:
[female vocal]
[Verse 1]
Walking through the morning light...
[female vocal]
[Chorus - anthemic]
WE KEEP MOVING FORWARD...
Just saying "female vocal" in the prompt alone is often ignored. The combination of prompt + lyrics tags is what works.
Duets and vocal trading
For duets with male/female vocals trading verses, use both the prompt and per-section lyrics tags:
- Prompt: "duet, male and female vocals trading verses, warm harmonies on chorus"
- Lyrics: Tag each section with who sings it:
[Verse 1 - male vocal, storytelling]
First verse lyrics here...
[Chorus - male and female duet, harmonies]
Chorus lyrics here...
[Verse 2 - female vocal, wry]
Second verse lyrics here...
[Bridge - male vocal, spoken]
Spoken bridge...
[Bridge - female vocal, sung]
Sung response...
This reliably produces vocal trading between sections and harmonies on shared parts.
Scene Presets
| Preset | BPM | Key | Use Case |
|---|---|---|---|
corporate-bg | 110 | C Major | Professional background, presentations |
upbeat-tech | 128 | G Major | Product launches, tech demos |
ambient | 72 | D Major | Overview slides, reflective content |
dramatic | 90 | D Minor | Reveals, announcements |
tension | 85 | A Minor | Problem statements, challenges |
hopeful | 120 | C Major | Solution reveals, resolutions |
cta | 135 | E Major | Call to action, closing energy |
lofi | 85 | F Major | Screen recordings, coding demos |
Task Types
text2music (default)
Generate music from text prompt + optional lyrics.
cover
Style transfer from reference audio. Control blend with --cover-strength (0.0-1.0):
- 0.2 — Loose style inspiration (more creative freedom)
- 0.5 — Balanced style transfer
- 0.7 — Close to original structure (default)
- 1.0 — Maximum fidelity to source
extract
Stem separation — isolate individual tracks from mixed audio.
Tracks: vocals, drums, bass, guitar, piano, keyboard, strings, brass, woodwinds, other
repainting (acemusic only)
Regenerate a specific time segment within existing audio while preserving the rest.
python tools/music_gen.py --repaint --input track.mp3 --repaint-start 15 --repaint-end 25 --prompt "Guitar solo" --output fixed.mp3
continuation (acemusic only)
Extend existing audio by continuing from where it ends.
python tools/music_gen.py --continuation --input track.mp3 --prompt "Continue with jazz piano" --output extended.mp3
Prompt Engineering
Caption Writing — Layer Dimensions
Write captions by layering multiple descriptive dimensions rather than single-word descriptions.
Dimensions to include:
- Genre/Style: pop, rock, jazz, electronic, lo-fi, synthwave, orchestral
- Emotion/Mood: melancholic, euphoric, dreamy, nostalgic, intimate, tense
- Instruments: acoustic guitar, synth pads, 808 drums, strings, brass, piano
- Timbre: warm, crisp, airy, punchy, lush, polished, raw
- Era: "80s synth-pop", "modern indie", "classical romantic"
- Production: lo-fi, studio-polished, live recording, cinematic
- Vocal: breathy, powerful, falsetto, raspy, spoken word (or "instrumental")
Good: "Slow melancholic piano ballad with intimate female vocal, warm strings building to powerful chorus, studio-polished production" Bad: "Sad song"
Key Principles
- Specificity over vagueness — describe instruments, mood, production style
- Avoid contradictions — don't request "classical strings" and "hardcore metal" simultaneously
- Repetition reinforces priority — repeat important elements for emphasis
- Sparse captions = more creative freedom — detailed captions constrain the model
- Use metadata params for BPM/key — don't write "120 BPM" in the caption, use
--bpm 120
Lyrics Formatting
Structure tags (use in lyrics, not caption):
[Intro]
[Verse]
[Chorus]
[Bridge]
[Outro]
[Instrumental]
[Guitar Solo]
[Build]
[Drop]
[Breakdown]
Vocal control (prefix lines or sections):
[raspy vocal]
[whispered]
[falsetto]
[powerful belting]
[harmonies]
[ad-lib]
Energy indicators:
- UPPERCASE = high intensity ("WE RISE ABOVE")
- Parentheses = background vocals ("We rise (together)")
- Keep 6-10 syllables per line within sections for natural rhythm
Video Production Integration
Music for Scene Types
| Scene | Preset | Duration | Notes |
|---|---|---|---|
| Title | dramatic or ambient | 3-5s | Short, mood-setting |
| Problem | tension | 10-15s | Dark, unsettling |
| Solution | hopeful | 10-15s | Relief, optimism |
| Demo | lofi or corporate-bg | 30-120s | Non-distracting, matches demo length |
| Stats | upbeat-tech | 8-12s | Building credibility |
| CTA | cta | 5-10s | Maximum energy, punchy |
| Credits | ambient | 5-10s | Gentle fade-out |
Timing Workflow
- Plan scene durations first (from voiceover script)
- Generate music to match:
--duration <scene_seconds> - Music duration is precise (within 0.1s of requested)
- For background music spanning multiple scenes: generate one long track
Combining with Voiceover
Background music should be mixed at 10-20% volume in Remotion:
<Audio src={staticFile('voiceover.mp3')} volume={1} />
<Audio src={staticFile('bg-music.mp3')} volume={0.15} />
For music under narration: use instrumental presets (corporate-bg, ambient, lofi).
For music-forward scenes (title, CTA): can use higher volume or vocal tracks.
Brand Consistency
Use --brand <name> to load hints from brands/<name>/brand.json.
Use --cover --reference brand_theme.mp3 to create variations of a brand's sonic identity.
For consistent sound across a project: fix the seed (--seed 42) and vary only duration/prompt.
Advanced Parameters
| Flag | Default | Description |
|---|---|---|
--thinking | on (acemusic) | 5Hz LM enriches prompts and generates audio codes |
--no-thinking | - | Faster generation, skip LM reasoning |
--variations N | 1 | Generate N variations (1-8, acemusic only) |
--guidance-scale | 7.0 | Prompt adherence (1.0-15.0) |
--infer-method | ode | ode (deterministic) or sde (stochastic, more variety) |
--seed | random | Lock randomness for reproducibility |
Technical Details
- acemusic cloud: XL Turbo 4B DiT + 4B LM, best quality, ~5-15s per generation
- Modal/RunPod: Standard Turbo 2B DiT, no LM, ~2-3s per generation
- Output: 48kHz MP3/WAV/FLAC
- Duration range: 10-600 seconds
- BPM range: 30-300
When NOT to use ACE-Step
- Voice cloning — use Qwen3-TTS or ElevenLabs instead
- Sound effects — use ElevenLabs SFX (
tools/sfx.py) - Speech/narration — use voiceover tools, not music gen
- Stem extraction from video — extract audio first with FFmpeg, then use
--extract
GitHub 仓库
相关推荐技能
content-collections
元Content Collections 是一个 TypeScript 优先的构建工具,可将本地 Markdown/MDX 文件转换为类型安全的数据集合。它专为构建博客、文档站和内容密集型 Vite+React 应用而设计,提供基于 Zod 的自动模式验证。该工具涵盖从 Vite 插件配置、MDX 编译到生产环境部署的完整工作流。
polymarket
元这个Claude Skill为开发者提供完整的Polymarket预测市场开发支持,涵盖API调用、交易执行和市场数据分析。关键特性包括实时WebSocket数据流,可监控实时交易、订单和市场动态。开发者可用它构建预测市场应用、实施交易策略并集成实时市场预测功能。
creating-opencode-plugins
元该Skill帮助开发者创建OpenCode插件,用于接入命令、文件、LSP等25+种事件。它提供了插件结构、事件API规范和JavaScript/TypeScript实现模式,适合需要拦截操作、扩展功能或自定义事件处理的场景。开发者可通过它快速构建响应式模块来增强OpenCode AI助手的能力。
sglang
元SGLang是一个专为LLM设计的高性能推理框架,特别适用于需要结构化输出的场景。它通过RadixAttention前缀缓存技术,在处理JSON、正则表达式、工具调用等具有重复前缀的复杂工作流时,能实现极速生成。如果你正在构建智能体或多轮对话系统,并追求远超vLLM的推理性能,SGLang是理想选择。
