ByteDance released the inference code and weights for Bernini-R, an open-weights video generation and editing model, on June 1, 2026. The release ships under Apache 2.0 and covers text-to-image, image-to-image, text-to-video, and several video editing modes in a single stack.
Bernini is a unified framework that pairs a multimodal language model semantic planner with a diffusion transformer renderer. ByteDance claims first-tier performance against closed commercial models on video editing tasks in pairwise human evaluations.
Try It: Local Video Editing on a Hopper Box
If you already have a Hopper-class GPU (H100, H200, or H800), Bernini-R is a clean drop-in for editing-focused workflows that previously required calling Runway, Pika, or Kling APIs. Pull the Bernini source repository on GitHub, set up PyTorch 2.5.1 with CUDA 12.4, and run the reference-guided editing pipeline on a clip plus a still reference to retarget motion, swap subjects, or extend scenes. Multi-GPU setups (8x H100 in ByteDance's examples) use Ulysses sequence parallelism for higher resolutions. Non-Hopper CUDA GPUs fall back to FlashAttention-2 or PyTorch SDPA, which means a single A100 or 4090 can run smaller test renders before you commit to a Hopper rental.
Why It Matters
Open-weights video editing has lagged image and text generation badly. Until recently the only credible Apache 2.0 options for video-to-video work were Wan2.2 and Cosmos, both more biased toward generation than precision editing. Bernini-R targets the editing arena directly, and the companion paper "Bernini: Latent Semantic Planning for Video Diffusion" on arXiv describes a Segment-Aware 3D Rotary Positional Embedding scheme purpose-built for the editing case. For creators priced out of Runway Aleph or stuck on the Adobe Premiere Generative Extend wait list, a self-hostable editing backbone changes the math on shot fixes, B-roll generation, and motion retargeting.
Key Details
Bernini-R builds on the Wan2.2-T2V-A14B components for VAE and UMT5 text encoding, and uses Qwen2.5-VL-7B-Instruct as the semantic planner. The renderer is a diffusion transformer. Recommended inference stack is FlashAttention-3 on Hopper, Python 3.11.2, CUDA Toolkit 12.4. The release includes both the model weights and the inference code; training code and dataset details are not in scope for this drop.
What to Do Next
If video editing is core to your workflow, clone the GitHub repo this week and run the reference-to-video example against a 5-second test clip to benchmark quality against your current API spend. If you do not own Hopper hardware, watch the ComfyUI and Diffusers communities for quantized GGUF or FP8 ports, which historically land within a week or two of a major ByteDance release. Editors evaluating NVIDIA Cosmos open models should add Bernini-R to the same comparison shortlist.