DeepSeek released DeepSeek-V4 Preview on April 24 with two MIT-licensed open models: V4-Pro at 1.6T total parameters (49B active) and V4-Flash at 284B total (13B active). Both ship with a default 1M token context window and are already live on the DeepSeek API, Hugging Face, and the company's own chat interface. It is the most aggressive open-weights release in 2026 to date, coming 18 hours after OpenAI's GPT-5.5 launch and just days after Anthropic reset Claude quotas for paying tiers.

Background

DeepSeek's release cadence has shortened every quarter since V2 in mid-2024. V3 arrived December 2024 and shocked the field with a sub-$6M training run that matched GPT-4 class performance; R1 followed in January 2025 and was the first open-weights reasoning model to compete with o1 on math and code. V4 Preview is the natural next beat: same Chinese-lab open-weights thesis, longer context, sparser activation, and a jump from 671B (V3) to 1.6T total parameters on the flagship. The company skipped a V3.x series and went straight to a V4 that is three to four times larger while still being cheaper to serve, thanks to Mixture-of-Experts routing that keeps only 49B parameters active per forward pass.

The timing is deliberate. In the 72 hours leading up to the drop, OpenAI shipped GPT-Image-2 and rolled it out to ChatGPT Plus and Pro, Google pushed Veo upscaling on Vertex AI, and Adobe detailed Firefly Foundry at its Summit. The closed frontier is moving fast. DeepSeek's reply is that open weights plus MIT licensing plus aggressive pricing can keep the open-source frontier within striking distance without waiting on Meta's Llama cadence.

Deep Analysis

The architecture story: MoE, sparse attention, FP4 training

Diagram contrasting DeepSeek V4-Pro's 1.6T total parameters with only 49B active per token
Diagram contrasting DeepSeek V4-Pro's 1.6T total parameters with only 49B active per token

V4's headline number is 1.6T parameters, but the operational number is 49B active per token. That 2.98% activation ratio is extreme even by MoE standards: Mixtral 8x7B activates about 25%, Qwen3.6's large MoE variant activates roughly 7%, and Grok-1 sat around 12%. DeepSeek is betting that routing precision, getting the right experts to fire for the right token, now matters more than raw activation width. The technical report points to two changes that make the ratio viable: DeepSeek Sparse Attention (DSA), which cuts attention compute at long context, and token-wise compression on the KV cache so 1M tokens fit without exploding GPU memory.

The training recipe is the other half of the story. DeepSeek used Muon-based optimization and FP4 quantization-aware training across roughly 32 trillion tokens. Muon is a recent second-order optimizer that several labs have now adopted for its loss-curve stability on very large batches; FP4 at training time (not just inference) is the first public confirmation that a lab has made it work end-to-end on a model this size. If the benchmarks hold under independent reproduction, DeepSeek has set a new reference point for training-cost efficiency at the frontier scale.

1M context as a default, not an add-on

Visualization of the 1M token context window as a tall stack of paper wrapped with a marker labeled 1M
Visualization of the 1M token context window as a tall stack of paper wrapped with a marker labeled 1M

Long context has been a paid-tier feature across the closed frontier: Claude's 1M window is locked to Claude Code and Enterprise, Gemini's 2M sits behind Vertex AI quota. Open models have been climbing towards it: Qwen3.6-27B reaches 262K native with 1M via YaRN extrapolation, Kimi K2.6 ships 128K. V4 is the first open-weights model where 1M is the default ship configuration and where the published tech report documents how the compression survives needle-in-a-haystack at depth.

For creators, that matters because long-context-as-default changes the workflow unit. A scriptwriter can feed an entire feature-length screenplay plus reference notes plus three prior drafts into a single V4 call without chunking. A researcher can paste a whole arXiv paper library and ask for comparative synthesis. A video editor using LLM-assisted edit decision lists can include full transcripts of a 2-hour interview plus the editor's notes and pull pacing suggestions across the whole conversation. The constraint shifts from "can my context window hold this?" to "is the retrieval inside the context accurate enough to trust?"

MIT license plus open weights equals deployment flexibility

MIT license badge indicating permissive open-weights deployment for DeepSeek V4
MIT license badge indicating permissive open-weights deployment for DeepSeek V4

The licensing is the quiet story. V4-Pro and V4-Flash are both released under MIT, the most permissive license still considered serious by corporate legal teams. Llama's community license restricts use at 700M monthly active users and requires attribution; Gemma's terms are similarly restrictive; even Qwen's Apache-2.0 release has a notice requirement. MIT strips those constraints: a studio can fine-tune V4-Flash on a proprietary brand-voice corpus and ship the result inside a commercial product without downstream reporting or derivative-naming obligations.

Combined with weights on Hugging Face (V4-Pro, V4-Flash), that means creators have three deployment paths the closed frontier cannot match: run locally on rented H100s for full data sovereignty, fine-tune on a private corpus for brand or IP alignment, or host behind your own API for multi-tenant clients. None of those paths require DeepSeek's permission or involve per-seat licensing. For studios under NDA or working with unreleased IP, that is not a nice-to-have, it is the whole pitch.

Comparison chart: input/output API prices per million tokens across V4-Flash, V4-Pro, Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro
V4-Flash sits 15 to 40x below Claude Opus 4.7 on input pricing at comparable quality for agentic tasks.

Pricing disruption at the API layer

Bar-chart comparison showing V4-Flash input pricing at $0.14 per million tokens versus Claude Opus 4.7 at $15 and GPT-5.5 at $10
Bar-chart comparison showing V4-Flash input pricing at $0.14 per million tokens versus Claude Opus 4.7 at $15 and GPT-5.5 at $10

V4-Flash starts at $0.14 per million input tokens and $0.28 per million output. V4-Pro lands at $1.74 and $3.48. For reference, Claude Opus 4.7 is $15 input and $75 output; GPT-5.5 sits at $10 and $40; Gemini 3.1 Pro at $1.25 and $10. Against Claude Opus, V4-Flash is roughly 100x cheaper on input and 250x cheaper on output. Even V4-Pro is 8x and 20x cheaper. For creators running agentic pipelines where long contexts and high output volumes are unavoidable, think automated screenwriter rewrites, shot-list generation across a full storyboard, or batch voice-script writing, the cost delta turns "this workflow is only viable at scale for Hollywood-budget clients" into "this workflow is viable on the freelance tier."

The drop-in compatibility layer is the other unlock. V4 exposes both the OpenAI ChatCompletions surface and the Anthropic Messages API, so any tool that already speaks to GPT or Claude can swap to V4 with a base_url change and a model name swap. Claude Code, OpenClaw, and Cursor all work out of the box. That removes the usual porting cost that has historically insulated closed-API incumbents from open-weights price pressure.

Impact on Creators

For independent creators, V4-Flash is the immediate winner: 1M context, thinking mode, sub-$0.30 per million output tokens, and an MIT-licensed binary you can host yourself. The practical use cases are writing assistants that can hold an entire novel in context, storyboard generators that reason about a full script, and agent harnesses for research that cost tens of cents per run instead of tens of dollars.

For small studios and indie production houses, V4-Pro is the more interesting option. Fine-tuning a 1.6T model is out of reach for most, but the open weights enable LoRA adapters on top of V4-Pro that capture brand voice, house style, or proprietary character bibles without handing that IP to a third party's training pipeline. Several VFX shops and animation houses we've spoken with over the last six months have cited licensing terms as the primary blocker to adopting Claude or GPT in production; MIT-licensed V4-Pro removes that blocker cleanly.

For commercial tool builders (the Cursors, the Windsurfs, the Runway research teams), V4-Flash changes the unit economics of "pro tier with generous limits." A coding tool that burns 100M tokens per power user per month costs $15 on Claude Opus input, $0.28 on V4-Flash input. Sustainable free tiers and aggressive Pro-plan headroom suddenly become pricing strategy, not a loss-leader gamble.

Key Takeaways

  • Open-source frontier is closer than it looks. V4-Pro trails only Gemini 3.1 Pro on world knowledge benchmarks and claims open-model SOTA on agentic coding. The gap to closed frontier is measurable but no longer decisive for non-reasoning tasks.
  • 1M context is now a commodity at the top tier. Expect every serious open-weights release in the next six months to ship at 1M minimum or be treated as behind-the-curve.
  • MIT licensing removes the legal-department blocker. Studios that could not deploy Llama or Gemma internally for license reasons can deploy V4 without escalation.
  • V4-Flash is the immediate workhorse. 13B active parameters on MoE means it runs on mid-tier hardware for local inference. The combination of cost, context, and permissive license makes it the default for AI writing assistants and long-context agents.
  • The deepseek-chat and deepseek-reasoner endpoints retire on July 24. Anyone still on those model names needs to migrate to deepseek-v4-flash or deepseek-v4-pro before then.

What to Watch

Three signals to track over the next 30 days. First, independent reproduction: Together AI, Fireworks, and SambaNova typically publish inference benchmarks within a week of a major open-weights drop. Their numbers will confirm whether DSA and FP4 claims hold on neutral hardware. Second, fine-tuning recipes: the first LoRA-on-V4-Pro adapters to hit Hugging Face will signal how approachable the model is for studios without a research team. Third, the closed frontier's response: Anthropic and OpenAI have both historically cut prices when open-weights competition closed the quality gap. Watch for quota increases or Opus/Pro-tier price cuts in the next two billing cycles.

If you're evaluating open-weights options alongside V4, Qwen3.6-27B remains the dense-model benchmark for coding, and Moonshot's Kimi K2.6 is worth running on your own workloads for comparison. For ongoing coverage of the closed-frontier side, our AI landscape deep dives are the fastest way to stay current.