Netflix has released VOID (Video Object and Interaction Deletion), its first public AI model. The open-source tool removes objects from video while preserving physically plausible interactions, solving a problem that existing video inpainting methods handle poorly.
What Happened
Netflix published the VOID model on HuggingFace along with an accompanying research paper and full source code on GitHub. The model is built on CogVideoX-Fun-V1.5-5b and fine-tuned for video inpainting with a novel quadmask conditioning system.
VOID processes video at 384x672 resolution and handles up to 197 frames. It uses a two-pass system: Pass 1 runs base inpainting, while the optional Pass 2 adds optical flow-warped noise for better temporal consistency on longer clips.
Why It Matters
Current video object removal tools can erase objects and fix appearance artifacts like shadows and reflections. But when the removed object was physically interacting with other elements, such as holding something or pushing an object, existing models produce implausible results. VOID addresses this directly.
The model uses a quadmask that encodes four distinct regions: the object to remove, overlap zones, affected regions where physics will change (objects that should fall or shift), and the background to keep. A vision-language model identifies these regions automatically during inference.
Key Details
- Architecture: 3D Transformer based on CogVideoX-Fun-V1.5-5b-InP (5 billion parameters)
- Training data: Paired counterfactual videos from HUMOTO (human-object interactions via Blender physics simulation) and Kubric (object-only interactions)
- Infrastructure: Trained on 8x A100 80GB GPUs with DeepSpeed ZeRO Stage 2
- Precision: BF16 with FP8 quantization support
- License: Open source with weights and code available
An interactive Gradio demo is available on HuggingFace Spaces for testing without local setup.
What to Do Next
Video editors and VFX artists can try the VOID project page demos to evaluate the model against their workflows. The full pipeline requires a GPU with at least 40GB VRAM for inference, though FP8 quantization can reduce memory requirements. Studios already using CogVideoX-based pipelines can integrate VOID directly.