ByteDance Open-Sources Bernini-R Video Edit Model

ByteDance released the inference code and weights for Bernini-R, an open-weights video generation and editing model, on June 1, 2026. The release ships under Apache 2.0 and covers text-to-image, image-to-image, text-to-video, and several video editing modes in a single stack.

Bernini is a unified framework that pairs a multimodal language model semantic planner with a diffusion transformer renderer. ByteDance claims first-tier performance against closed commercial models on video editing tasks in pairwise human evaluations.

Try It: Local Video Editing on a Hopper Box

If you already have a Hopper-class GPU (H100, H200, or H800), Bernini-R is a clean drop-in for editing-focused workflows that previously required calling Runway, Pika, or Kling APIs. Pull the Bernini source repository on GitHub, set up PyTorch 2.5.1 with CUDA 12.4, and run the reference-guided editing pipeline on a clip plus a still reference to retarget motion, swap subjects, or extend scenes. Multi-GPU setups (8x H100 in ByteDance's examples) use Ulysses sequence parallelism for higher resolutions. Non-Hopper CUDA GPUs fall back to FlashAttention-2 or PyTorch SDPA, which means a single A100 or 4090 can run smaller test renders before you commit to a Hopper rental.

Why It Matters

Open-weights video editing has lagged image and text generation badly. Until recently the only credible Apache 2.0 options for video-to-video work were Wan2.2 and Cosmos, both more biased toward generation than precision editing. Bernini-R targets the editing arena directly, and the companion paper "Bernini: Latent Semantic Planning for Video Diffusion" on arXiv describes a Segment-Aware 3D Rotary Positional Embedding scheme purpose-built for the editing case. For creators priced out of Runway Aleph or stuck on the Adobe Premiere Generative Extend wait list, a self-hostable editing backbone changes the math on shot fixes, B-roll generation, and motion retargeting.

Key Details

Bernini-R builds on the Wan2.2-T2V-A14B components for VAE and UMT5 text encoding, and uses Qwen2.5-VL-7B-Instruct as the semantic planner. The renderer is a diffusion transformer. Recommended inference stack is FlashAttention-3 on Hopper, Python 3.11.2, CUDA Toolkit 12.4. The release includes both the model weights and the inference code; training code and dataset details are not in scope for this drop.

What to Do Next

If video editing is core to your workflow, clone the GitHub repo this week and run the reference-to-video example against a 5-second test clip to benchmark quality against your current API spend. If you do not own Hopper hardware, watch the ComfyUI and Diffusers communities for quantized GGUF or FP8 ports, which historically land within a week or two of a major ByteDance release. Editors evaluating NVIDIA Cosmos open models should add Bernini-R to the same comparison shortlist.

ByteDance Open-Sources Bernini-R Video Edit Model

Try It: Local Video Editing on a Hopper Box

Why It Matters

Key Details

What to Do Next

Keep reading

ComfyUI v0.29.0 Adds HeyGen, GPT-5.6, and Gemma4 Nodes

Sessiongrep: Searchable Memory for AI Coding Agents

How to Make YouTube Thumbnails With AI (2026 Guide)

Try It: Local Video Editing on a Hopper Box

Why It Matters

Key Details

What to Do Next

Stay ahead of AI

Keep reading

ComfyUI v0.29.0 Adds HeyGen, GPT-5.6, and Gemma4 Nodes

Sessiongrep: Searchable Memory for AI Coding Agents

How to Make YouTube Thumbnails With AI (2026 Guide)

Stay ahead of Creative AI