StepFun pushed Step 3.7 Flash to Hugging Face on May 28, 2026, an open-weights 201B-parameter Mixture-of-Experts vision-language model with native image and video input, a 256K context window, and same-day quantized variants (FP8, GGUF, NVFP4) for local deployment. NVIDIA published a launch-day developer guide covering deployment on SGLang, TensorRT-LLM, and vLLM, plus a hosted endpoint on build.nvidia.com.

Try it: run multimodal extraction without a paid API

The fastest path for a creator workflow is the build.nvidia.com hosted endpoint, which exposes Step 3.7 Flash as a free prototype API behind a NVIDIA developer login. Send a long PDF, a screenshot stack, or a short video clip alongside a prompt like "extract every product spec, then summarize the table at the end" and the model returns structured JSON in a single call. For longer workflows that would exceed the hosted rate limit, swap to the FP8 weights running locally in vLLM, which fits on a single 80GB H100 thanks to the 11B-active sparse MoE design. The 256K context means an entire shoot's worth of frame stills, or a full operator's manual, lands in one request without chunking.

Why it matters

Open-weights multimodal at this size and license is the part of the stack that has been moving slowest. Most creators with a vision task have been routing to Gemini, Claude, or GPT for screenshot understanding, video summarization, and document extraction because the open alternatives (Qwen2.5-VL, InternVL3) topped out below 80B and had narrow context windows. Step 3.7 Flash matches the proprietary models on context length while staying inspectable, fine-tunable, and free to run on your own hardware. The NVFP4 build at 104B in particular is built to run on a single NVIDIA DGX Spark workstation, which is the form factor most independent studios already own.

Key details

Step 3.7 Flash uses a sparse MoE architecture that activates roughly 11B parameters per token, the same efficiency trick StepFun shipped with Step 3.5 Flash in March. The 3.7 release upgrades the vision tower to handle native video input (not just sampled frames) and extends context to 256K tokens, large enough to fit a feature-length screenplay plus reference images in a single prompt. NVIDIA's blog frames the target workload as enterprise document intelligence, but the multimodal endpoint handles the same prompts a creator would send to a captioning, OCR, or video-summary tool. Quantized variants are available in GGUF for llama.cpp users and NVFP4 for NIM container deployments, with FP8 covering the gap for standard inference servers.

What to do next

Decide whether you need cloud or local. Cloud-only creators should bookmark the build.nvidia.com endpoint and run a side-by-side test against whichever paid VLM they currently use for screenshot or PDF extraction. Anyone with an H100 or DGX Spark on hand should pull the FP8 or NVFP4 weights and benchmark on their own data, since the model card admits the NVFP4 quantization is the first community attempt and quality on long videos is still being characterized.