Stable Audio 3 can take a one-sentence prompt and return a full 6-minute track in under 10 seconds on an H200, or a few seconds on an M4 MacBook Pro. The hard part is what comes next: turning that raw stereo render into something you can actually release. This walkthrough takes a single text prompt through prompt design, generation, stem separation, DAW arrangement, and basic mastering, finishing with a track ready for distribution. Total active time is roughly 45 minutes on a laptop. Total cost is zero if you use the open-weights Medium model, or under $1 if you generate via a hosted Space.
What You Need
- Stable Audio 3 Medium (2B parameters, open weights, Stability AI Community License) or access to a hosted demo
- stable-audio-tools repo (MIT) for local inference
- A DAW: Ableton Live, FL Studio, Logic Pro, or the free Audacity for basic editing
- A stem-separation tool (Demucs, LALAL.AI, or RipX) for splitting the rendered stereo into drums, bass, vocals, and instruments
- A mastering chain. Free options include the LANDR web service or the open-source matchering library. Paid options include iZotope Ozone
- Hardware: an Apple Silicon Mac, a recent NVIDIA GPU with 12GB+ VRAM, or any machine with internet access for the hosted route
The Workflow
Step 1: Design the prompt
Stable Audio 3 responds best to prompts that specify genre, tempo, instrumentation, mood, and reference era in a single sentence. The model was trained on 1,278,902 audio files split between licensed AudioSparx recordings and CC-licensed Freesound clips, so it has broad coverage of both musical genres and sound effects. Lean into descriptors the model has likely seen many times.
A working template: "[Tempo in BPM] [genre] track with [lead instrument], [supporting instrumentation], [percussion description], [mood], [era or production reference]." Example: "105 BPM downtempo electronic track with detuned analog synth lead, deep sub bass, dusty boom-bap drums, melancholic mood, late-90s trip-hop production."
Avoid lyrics. Stable Audio 3 generates instrumental music and sound effects, not sung vocals. If you want vocals, layer them in your DAW from a separate vocal generation model or your own recording.
Step 2: Generate at maximum length
Set the generation length to the model maximum. Stable Audio 3 supports variable-length outputs from a few seconds up to six minutes and twenty seconds per render. Generating long gives you raw material to arrange. Shorter renders are fine for one-shots and ad music but they limit your editing options.
Inference settings to start with: 8 steps, classifier-free guidance scale around 4 to 6, seed left random. The model uses an adversarial post-training optimization that converges in eight steps. Pushing to 16 or 24 steps rarely improves quality on the Medium variant and roughly doubles wall-clock time.
Generate three to five candidates per prompt. The output is a 44.1 kHz stereo WAV. Save each take with the seed embedded in the filename so you can reproduce favorites.
Step 3: Split into stems
The raw render is a single stereo file. To arrange and mix, you need stems. Run the take through a stem separator. Demucs v4 splits into drums, bass, vocals, and other in roughly real time on a recent GPU and is open source. LALAL.AI and RipX offer cleaner separation through a paid API.
For instrumental tracks Stable Audio 3 produces, the "vocals" stem will usually be empty or contain melodic instrument bleed. That is normal. You will work with the drums, bass, and other (harmonic content) stems in the next steps.
Step 4: Arrange in your DAW
Import the three stems on separate tracks. Tempo-map the project to the BPM you prompted for. If the model drifted off your target tempo, use your DAW's beat-detection (Ableton's Warp, FL's Stretch, Logic's Smart Tempo) to lock everything to a grid.
From here the workflow is conventional production. Cut the song into intro, verse, chorus, bridge, and outro by chopping the rendered audio into regions and rearranging. Drop in your own one-shots, layered drums, or a separately generated vocal to lift sections. Stable Audio 3 also supports inpainting through its semantic-acoustic autoencoder, so if a four-bar section feels weak you can regenerate just that window without redrawing the rest.
This is where the workflow stops feeling like prompt engineering and starts feeling like production. The model gave you a 6-minute sketch. The arrangement decisions are yours.
Step 5: Mix and master
Run each stem through subtractive EQ to clean overlap. The drum stem usually wants a high-pass at 30 to 40 Hz to remove rumble. The bass stem benefits from a low-pass around 5 kHz to keep it out of the upper-mids. Compress the drum bus to glue the kit, around a 4:1 ratio with 6 dB of gain reduction.
For mastering, the cheapest path is the open-source matchering library, which matches your track's loudness and tonal balance to a reference. For broadcast or streaming targets, route through iZotope Ozone or a similar full-featured mastering suite to hit minus 14 LUFS integrated, which is the streaming loudness norm most platforms target. Export a 24-bit WAV master plus a 16-bit 44.1 kHz file for distribution.
Troubleshooting
Model ignores the genre tag. Try the era or production reference instead. "Late-90s trip-hop production" is a more specific signal than "trip-hop" alone, and references the model has likely seen with consistent acoustic features.

Output is too short. Check the inference config. The Medium model defaults to a 47-second window in some forks. The full 6-minute, 20-second cap requires explicitly setting the length parameter in your inference script or Gradio UI.
Drums are muddy after stem separation. Demucs occasionally bleeds kick energy into the "other" stem. Re-run with the htdemucs_ft fine-tuned model, or accept the bleed and high-pass the other stem instead.
Commercial use blocked. The Stability AI Community License covers research and non-commercial use of Medium and Small weights. For commercial release, get a license at stability.ai/license. Open-source projects under matching license terms can use Medium and Small directly.
What to Try Next
Layer two Stable Audio 3 renders at the same tempo and key for fuller arrangements. Use the inpainting capability to regenerate just the bridge while keeping the verse and chorus intact. Generate sound effects with the Small SFX variant and weave them into transitions. Read our coverage of the Stable Audio 3 open-weights launch for the model architecture details, and check the 15,000-sample free pack generated entirely with Stable Audio 3 for production-ready drum hits and instrument one-shots you can drop into this workflow.
FAQ
Can Stable Audio 3 generate vocals?
No. Stable Audio 3 generates instrumental music and sound effects only. The training data and conditioning are oriented toward music and audio rather than singing. For vocals, layer in a separate vocal model or your own recording during the DAW arrangement step.
What hardware do I need to run Stable Audio 3 Medium locally?
A MacBook Pro with an M4 chip generates in a few seconds. NVIDIA GPUs from the RTX 4090 and up handle inference comfortably. Older 12 GB GPUs work but will be slower. The Medium model is 2B parameters in F32 tensors, so plan for roughly 8 GB of VRAM with mixed precision.
Is the output safe to release commercially?
The Medium and Small models ship under the Stability AI Community License. Personal and non-commercial release is allowed. Commercial release requires a separate license from stability.ai/license. The Large variant is enterprise-only from launch. Always confirm the current license terms before distribution.
How does Stable Audio 3 compare to Suno or Udio for producers?
Suno and Udio focus on full-song generation with sung lyrics, optimized for finished tracks rather than producer raw material. Stable Audio 3 is instrumental-only with open weights and inpainting, which makes it a better fit for producers who want to own the stems and arrange themselves. For a finished song with vocals, Suno or Udio remains the faster path.
Can I run this workflow without a paid DAW?
Yes. Audacity handles the editing steps for free. The free tier of BandLab covers multitrack arrangement and basic mixing in the browser. Demucs and matchering are both open source. The complete workflow can run on free software end to end if you accept the workflow ergonomics tradeoff against Ableton or FL Studio.