Three significant open-source models arrived in ComfyUI on May 14, 2026: VOID for video object deletion, BiRefNet for high-precision image segmentation, and Gemma 4 for multimodal reasoning. All three are free, locally runnable, and available immediately through official Comfy-Org packages on Hugging Face.

These are not minor additions. VOID removes objects from video along with the physical traces they leave behind: shadows, reflections, and motion effects that conventional inpainting ignores. BiRefNet is already the segmentation backbone behind dozens of popular ComfyUI background removal nodes, and the native integration standardizes its installation. Gemma 4 brings Google DeepMind's multimodal model directly into ComfyUI workflows for the first time, enabling text, image, and audio reasoning inside your node graph without leaving the application.

ComfyUI workflow showing VOID video inpainting, BiRefNet segmentation, and Gemma 4 nodes

VOID: Netflix's Video Object Deletion Model

VOID (Video Object and Interaction Deletion) was developed at Netflix Research to address one of the hardest problems in video editing: removing an object without leaving behind the physical evidence it existed. Standard inpainting fills the erased region, but a walking person leaves a shadow on the floor, a reflection in a window, and motion blur on nearby objects. VOID accounts for all of these.

The model uses a two-pass diffusion architecture built around a quadmask system. The quadmask is a greyscale image with four distinct values that tell VOID exactly what to do with each region:

  • Remove zone: pixels belonging to the object being deleted
  • Overlap zone: regions where the removed subject intersects with other elements
  • Physics zone: areas affected by shadows, reflections, or motion caused by the removed object
  • Untouched zone: everything that should remain as-is

Pass 1 handles the initial region inpainting using a diffusion model. Pass 2 adds optical flow refinement using the RAFT algorithm to maintain temporal consistency across frames. The result holds up over time in a way that single-frame inpainting cannot.

The official VOID model package on Hugging Face includes five files: void_pass1.safetensors and void_pass2.safetensors for the diffusion passes, the RAFT optical flow model, a T5 text encoder, and a CogVideoX VAE. The full file structure is documented in the repository. For mask creation, pair VOID with SAM3 or any ComfyUI segmentation node to generate the quadmask from your video input.

VOID quadmask zones showing remove, overlap, physics, and untouched regions on a video frame

BiRefNet: High-Precision Image and Background Segmentation

BiRefNet (Bilateral Reference Network) is described by Comfy-Org as one of the most widely used image segmentation backbones in open-source ecosystems. It already powers popular ComfyUI background removal nodes including ComfyUI-RMBG. The native integration provides a standardized installation path that does not require a separate custom node.

BiRefNet handles three distinct segmentation tasks in a single lightweight model:

  • Dichotomous image segmentation: clean foreground-background separation on complex scenes with precise edge boundaries
  • Salient object detection: identifies and isolates visually prominent objects from their surroundings
  • Camouflaged object detection: finds subjects that naturally blend into their backgrounds, such as animals in foliage or patterned subjects against similar textures

The model preserves fine edge detail including hair strands, fur texture, semi-transparent materials, and thin structures that confuse most segmentation approaches. This makes it practical for product photography cleanup, portrait compositing, green-screen replacement, and object isolation before feeding into image generators.

The BiRefNet source repository on GitHub contains the original research and model documentation. For ComfyUI, download birefnet.safetensors from the ZhengPeng7 Hugging Face repository and place it in ComfyUI/models/background_removal/. The model will appear in ComfyUI's background removal and segmentation nodes automatically after restart.

Gemma 4: Multimodal Reasoning Inside ComfyUI Workflows

Gemma 4 is Google DeepMind's multimodal open-weights model, now available natively in ComfyUI through Comfy-Org's packaging. The Comfy-Org/gemma-4 Hugging Face repository provides instruction-tuned variants formatted as safetensors files in BF16 and FP8 precision for direct use in ComfyUI.

Gemma 4 accepts text, image, audio, and video inputs and generates text output. It supports context windows from 128K to 256K tokens, which makes it practical for analyzing long creative briefs, multi-page style guides, or batch image description workflows. The model includes a configurable thinking mode that enables step-by-step reasoning for complex classification or analysis tasks.

According to the Gemma 4 announcement on Hugging Face, four model sizes are available:

  • E2B (approximately 2 billion effective parameters): designed for edge devices, runs on 6 to 8 GB VRAM
  • E4B (approximately 4 billion effective parameters): better reasoning accuracy, fits in 10 to 12 GB VRAM
  • 26B A4B (Mixture of Experts): high-throughput inference with mixture-of-experts architecture
  • 31B (dense model): maximum accuracy for dedicated inference machines

All variants are released under Apache 2.0, permitting commercial use, modification, and redistribution without restrictions.

Gemma 4 E4B model processing image and text inputs inside a ComfyUI node graph

Step-by-Step: Installing All Three Models in ComfyUI

All three models follow the standard ComfyUI model placement convention described in the ComfyUI GitHub repository. None require custom nodes for the core functionality.

Installing VOID:

  1. Go to Comfy-Org/void-model on Hugging Face and download all five model files.
  2. Place void_pass1.safetensors and void_pass2.safetensors in ComfyUI/models/diffusion_models/
  3. Place raft_large_C_T_SKHT_V2-ff5fadd5.safetensors in ComfyUI/models/optical_flow/
  4. Place t5xxl_fp16.safetensors and cogvideox_vae.safetensors in ComfyUI/models/text_encoders/
  5. Import the VOID workflow template (.json) from the Comfy-Org repository and use SAM3 or another segmentation model to generate your quadmask from a video clip.

Installing BiRefNet:

  1. Download birefnet.safetensors from the ZhengPeng7/BiRefNet repository on Hugging Face.
  2. Place it in ComfyUI/models/background_removal/
  3. Restart ComfyUI. The model will appear automatically in background removal and segmentation node dropdowns.

Installing Gemma 4:

  1. Go to Comfy-Org/gemma-4 on Hugging Face and choose your variant. For most users with a single GPU, start with gemma4_e2b_it_bf16.safetensors (2B) for speed or gemma4_e4b_it_bf16.safetensors (4B) for better reasoning.
  2. Place the file in ComfyUI/models/text_encoders/
  3. Use the Gemma 4 node in ComfyUI to wire text prompts and image inputs into your workflow for captioning, analysis, or conditional generation.

What These Models Enable for Creators

Combined, VOID, BiRefNet, and Gemma 4 fill capability gaps that previously required separate paid services or complex workarounds.

VOID enables VFX-quality object removal for short-form video editors who previously needed Adobe Firefly Video, DaVinci Resolve's Magic Mask, or manual rotoscoping to achieve physically correct erasure. Running locally means no upload limits, no per-minute charges, and no file size caps. The physics-aware inpainting is particularly useful for removing branded products, logos, or unwanted background figures from footage.

BiRefNet's native integration eliminates the need to track and install a separate custom node just for clean segmentation. For creators already using ComfyUI for compositing workflows, background removal is now one fewer external dependency. The model's ability to detect camouflaged objects also opens up creative use cases beyond standard background removal, including style transfer on specific regions and targeted upscaling on isolated subjects.

Gemma 4 opens up prompt engineering and metadata generation directly inside the workflow. Use the E2B model to auto-generate captions for image batches, describe reference images before passing them to Flux or SDXL generators, or analyze style guides to extract relevant keywords for prompt construction. The 256K context window makes it practical for analyzing long creative documents that would overflow other local models.

For the previous round of ComfyUI additions, including native Claude API, Grok, and OpenAI image generation nodes, see the ComfyUI 0.21.1 update coverage.

Frequently Asked Questions

What GPU does VOID require?

VOID works best on a GPU with at least 12 GB VRAM for the two-pass diffusion pipeline plus the RAFT optical flow model. CPU fallback is supported but will be significantly slower for video tasks longer than a few seconds. Check the Comfy-Org/void-model repository for minimum hardware benchmarks as the community documents them.

Can BiRefNet replace custom nodes like ComfyUI-RMBG?

BiRefNet is the underlying model in many custom background removal nodes including ComfyUI-RMBG. The native integration adds a standardized installation path that does not require installing a separate custom node. If you are already happy with your existing node, you can continue using it. The native path is simpler for new installations.

Which Gemma 4 variant should I start with?

Start with E2B (2B) if you have 8 GB VRAM or less, or if you need fast inference for batch workflows. The E4B (4B) model offers meaningfully better reasoning on complex image description and analysis tasks and fits on a 12 GB card. The 31B dense model is for dedicated inference machines where accuracy outweighs speed.

Are workflow templates included?

Yes. Comfy-Org's Hugging Face repositories for all three models include downloadable workflow template .json files that can be imported directly into ComfyUI from File > Load Workflow. These provide tested starting points for each model's core use case.

Can Gemma 4 generate images inside ComfyUI?

No. Gemma 4 is a text-output model. Inside ComfyUI, it functions as a reasoning and analysis node that takes text and images as input and produces text output. Use it upstream of your image generation nodes to improve prompt quality, not as a replacement for Flux, SDXL, or other diffusion models.

Is there a cost to use these models?

All three models are free and open-source. VOID and Gemma 4 use Apache 2.0 licenses. BiRefNet's license terms are documented in the ZhengPeng7/BiRefNet repository on GitHub. Verify the specific license for your use case before deploying in a commercial product.