Google released Gemma 4 12B on Wednesday, a new unified multimodal open-weights model that drops the dedicated image and audio encoders Gemma 4 used at launch. The 12B variant projects raw image patches and audio waveforms straight into the model's embedding space through lightweight linear layers, putting native vision, video, and speech understanding inside a single 12-billion-parameter checkpoint that fits on a laptop.

The model is live on Hugging Face in both pre-trained and instruction-tuned variants, with the encoder-free architecture documented as the headline change versus the original Gemma 4 E2B, E4B, 26B-A4B, and 31B sizes shipped on April 2.

What This Enables

Local video and audio Q&A without a separate captioner. The 12B model accepts frame sequences up to 60 seconds of video and audio clips up to 30 seconds for ASR or speech-to-translated-text in one pass, alongside its 256K token context and 140+ language coverage. For a creator workflow, that means dropping a podcast clip plus a reference image plus a long prompt into one inference call on an M-series Mac or an RTX 4090 box and getting back transcript, translation, and visual reasoning together. Function calling and configurable thinking mode are wired in, so the same checkpoint drives agent loops that previously required stitching Whisper plus a vision model plus a text LLM. Spec details and quickstart are in the Gemma core docs.

Why It Matters

Two things shift here. First, the encoder-free design cuts the parameter overhead that separate vision and audio towers add on smaller Gemma 4 sizes, which is why Google can land full multimodal capability in 11.95B params instead of a 20B-plus stack. Second, the 12B slot was the missing rung between the 4B-class E4B (fast but limited reasoning) and the 26B/31B sizes (heavier than most laptops can run quantized). Twelve billion at BF16 fits in roughly 24GB of VRAM, and 4-bit quantization brings it under 8GB. That makes Gemma 4 12B the first single-model option for the "local agent that sees, hears, and reasons" use case Google has been pitching since I/O.

Key Details

The 12B model has 48 layers, a 262K vocabulary, and the same 256K context window as the rest of the Gemma 4 line. Google states multimodal inputs include text, images at variable aspect ratios, video frame sequences, and audio. Output is text only. The license is the standard Gemma terms (commercial use allowed, with a small number of distribution restrictions), unchanged from the April 2 release. The Register's coverage of the original Gemma 4 launch covers the licensing posture if you have not seen it. Ollama, LM Studio, and llama.cpp integrations land on the same day as the Hugging Face drop per the model card.

What to Do Next

Pull the model in LM Studio or Ollama tonight and rerun whatever local multimodal workflow you have wired up against Llama 3.2 Vision, Qwen2-VL, or Gemma 4 E4B. The 12B should outperform the E4B on every multimodal benchmark Google ships and approach the 26B on reasoning while staying within consumer hardware. Our prior guides on Gemma 4 2B local tool calling and the Framedex local video indexing pipeline drop straight into the 12B with no orchestration changes since the API surface is identical.