Baidu's ERNIE Team has open-sourced NAVA, a 6.3-billion-parameter joint audio-video generation model that produces 720p video with synchronized dual-channel audio up to one minute long. The release includes full inference code under Apache 2.0 and a public checkpoint on Hugging Face, landing May 29 with the companion paper "Native Audio-Visual Alignment for Generation."
What this enables
If you have an 8-GPU box with at least 42 GB of VRAM per card, you can clone the repo, download the checkpoint, and generate end-to-end video with native audio in a single pass. Standard mode runs at roughly one second per step on 8 H100s; the T5 offload mode drops VRAM to 48 GB while keeping the same speed, and group offload mode brings it down to 42 GB at the cost of roughly 3.5 seconds per step. The included Gradio demo and batch scripts mean you can prototype dialogue scenes, scored cinematics, or timbre-controlled narration without wiring up a separate TTS or Foley pipeline.
Why it matters
Most open joint audio-video work has split into two camps: dual-tower designs that align audio and video after the fact, or fully unified tri-modal stacks that mix text, audio, and video in one shared space. The former weakens fine-grained co-evolution; the latter couples high-level semantics with low-level synchronization. NAVA's "Align-then-Fuse MMDiT" architecture establishes audio-video correspondence in a dedicated interaction space first, then uses external context to condition joint denoising. Verse-Bench and Seed-TTS results in the paper show superior video quality and tighter A/V sync at 6.3B parameters than larger unified baselines.
Key details
NAVA outputs landscape, portrait, and square video up to one minute, with dual-channel audio, multi-timbre speech control via "Timbre-in-Context Conditioning," and language-described camera control. Sample outputs on the project page include conversational dialogue with on-screen lip sync, ambient scene audio, and a reference-timbre demo where the same speech span is regenerated in different voices. The training pipeline relies on high-quality dense captions, so the repo ships with an integrated LLM-based prompt rewriter that significantly improves output quality for short user prompts.
What to do next
Pull the repo, run scripts/inference.sh with the default config, and start with the included demo prompts before swapping in your own. If you don't have 8 H100s sitting around, the Hugging Face papers page is the cleanest way to track community-contributed quantized checkpoints and reduced-step samplers, both of which the open-source ecosystem typically ships within days of an Apache-licensed audio-video release.