Sony AI has open-sourced Woosh, a sound effects foundation model that generates audio from text prompts and video input. The release includes inference code, model weights, and distilled versions for running on consumer hardware.
What Happened
Sony AI published Woosh on April 2 via arXiv, along with open source code and pre-trained weights. The model is built as a complete audio generation pipeline: a high-quality encoder/decoder, a text-audio alignment model for conditioning, and two generation modes for text-to-audio and video-to-audio synthesis.
The distilled variants reduce computational requirements while maintaining generation quality, making deployment feasible on machines without data center hardware.
Why It Matters
Sound effects have been one of the slower categories to get open source AI treatment. Most creators still rely on sound libraries or paid services. Woosh changes that by offering both text-to-audio (describe the sound you want) and video-to-audio (let the model watch your clip and generate matching effects) in a single open pipeline.
The video-to-audio capability is particularly useful for video editors and game developers who need synchronized sound design. Instead of manually layering effects, you feed the model a video clip and it generates contextually appropriate audio.
Sony AI's evaluation shows Woosh performs competitively with or better than existing open alternatives like StableAudio-Open and TangoFlux across their benchmark suite.
Key Details
- Publisher: Sony AI
- Components: Audio encoder/decoder, text-audio alignment model, text-to-audio generator, video-to-audio generator
- Distilled models: Lightweight variants included for resource-constrained environments
- Benchmarks: Competitive with or better than StableAudio-Open and TangoFlux
- License: Open source (inference code and model weights available)
- Use cases: Sound design, video post-production, game audio, content creation
What to Do Next
Check the Woosh paper for architecture details and benchmark comparisons. The inference code and weights are available through the project's GitHub repository linked in the paper. If you work in video production or game development, the video-to-audio mode is worth testing against your current sound design workflow.