ByteDance Research has released Lance, a 3B parameter unified multimodal model that handles image generation, video generation, image editing, video editing, and visual question answering inside a single Apache 2.0 framework. The model card and accompanying technical paper went live this week. Lance is one of the few unified models at this size that can both create and edit across image and video without bolting on separate adapters or pipelines.

What this enables for creators

If you run open-weights image and video workflows on HuggingFace locally or on a single GPU, Lance gives you one model where you used to need three. Generate an image from text, edit that image with a follow-up prompt, then turn the edited frame into a short clip without leaving the model. Pair it with SANA-WM-style pipelines for longer clips, or swap it in as the text-to-image stage of your ComfyUI graph. The unified parameter set means cleaner consistency across the generate and edit calls than chaining two specialised models.

Why it matters

Most open-weights image generators stop at text-to-image. Most open-weights video generators stop at text-to-video. Unifying generation, editing, and understanding into one 3B model collapses the dependency footprint and the prompt drift you get from passing tokens between separate model calls. Apache 2.0 licensing means commercial use is on the table without negotiating, which is the deciding factor for most freelance and small-studio creators. ByteDance running this at the 3B scale is also a hardware-friendliness signal: this is meant to be served, not just demonstrated.

Key details

Lance is fine-tuned from Qwen2.5-VL-3B-Instruct and trained from scratch on a staged multi-task recipe within a 128 A100 GPU budget, modest by frontier standards. Benchmarks reported on the model card: DPG-Bench 84.67 overall for image generation, GenEval 0.90 overall, GEdit-Bench 7.30 out of 10 for image editing, and VBench 85.11 total, which the team claims leads unified models at the 3B scale. The model uses 3B active parameters and is published under Apache 2.0. ByteDance has also been pushing multimodal work elsewhere, including with the NVIDIA Cosmos Predict 2.5 ecosystem and other open-weights releases over the past month.

What to do next

Pull the weights from huggingface.co/bytedance-research/Lance, run the included inference samples for image and video generation, then test the edit path: generate, edit, and re-edit the same scene to see how identity holds across calls. Compare its DPG-Bench and GenEval behaviour against your current open-weights image stack on a small held-out prompt set before deciding whether to make it your default. The unified framing is the part that matters for prod workflows, not the headline benchmark numbers.