DeepSeek made its 75% V4-Pro price cut permanent on May 23, 2026, ending a temporary discount that was set to expire May 31. The flagship reasoning model now bills $0.435 per million input tokens at a cache miss, $0.003625 per million on a cache hit, and $0.87 per million output tokens. The cache hit rate is one tenth of what it was at V4-Pro's April 26 launch, and the new sticker is locked at one quarter of the original $1.74 / $3.48 input/output. For the first time, a top-five reasoning model on public benchmarks sells under $1 per million output tokens, with no expiry attached. The DeepSeek pricing page reflects the change live.
Background
V4-Pro shipped on April 26 as DeepSeek's reasoning-tuned flagship variant, positioned above the smaller V4-Flash for coding, math, and agent workloads. Original launch pricing was $1.74 per million input tokens at a cache miss and $3.48 per million output, with a $0.03625 per million cache hit tier. Within four weeks DeepSeek rolled out a temporary 75% promotion that dropped the headline numbers to the May 23 figures, framed at the time as a window to drive trial. Saturday's announcement, per The Next Web, removes the May 31 expiry and presents the new floor as a long-term competitive posture. AI Weekly notes this is the lab's most aggressive long-term commitment to cost leadership since the V4 launch.
Why It Matters
The price-quality curve in May 2026 now has a permanent outlier at the cheap end. V4-Pro tops the Artificial Analysis leaderboard on several reasoning benchmarks, sitting next to Claude Opus 4.7, Gemini 3.1 Pro, GPT-5.5, and the freshly launched Qwen3.7-Max. Those four list at multi-dollar-per-million output, with Claude Opus 4.7 at $15 per million output and Gemini 3.1 Pro at $15 per million output. V4-Pro at $0.87 per million output is roughly 17 times cheaper than its closest benchmark peers, and the gap is no longer a quarterly promotion to be waited out.
For creators running unattended pipelines (script writing, code generation, voice agent reasoning, retrieval-grounded research) the variable is total monthly spend on a fixed workflow. A workload that runs $5,000 a month on Claude Opus 4.7 lands closer to $290 on V4-Pro at the same volume, assuming similar output token counts. That is no longer a price-shop decision; it is a budgeting decision that changes which workflows are profitable to automate at all.
Deep Analysis
Cache-Hit Pricing Becomes the Dominant Cost Lever
The bigger structural move is not the cache miss number, which already moved during the promotion. It is the cache hit tier at $0.003625 per million tokens, one tenth of the original launch rate and effectively a rounding error against the rest of the bill. The vast majority of production agent calls reuse a long system prompt, a tool catalog, and either pinned context or retrieved chunks that repeat across calls. When 80%+ of the input is a cache hit, input cost on V4-Pro drops below $0.01 per million effective tokens. The remaining bill is essentially the output number ($0.87 per million), with everything else rounding to zero.
This changes how creator pipelines should be architected. The historical rule was to keep system prompts short to control cost. The new rule is the opposite: load the system prompt with everything the model needs to be accurate (style guide, banned phrases, format examples, prior turn summaries), because input cost barely moves once you cross the cache threshold. The same is true for retrieved chunks that get reused across a batch run, like a fixed reference document or a stable taxonomy.
Where V4-Pro Lands vs Claude, GPT, Gemini, Qwen
The relevant peer set for V4-Pro is the frontier reasoning tier, not chat-tuned mid-range models. Claude Opus 4.7 lists at $15 input / $75 output per million tokens on the Anthropic pricing page (Anthropic also publishes a 90% prompt cache discount for repeated prefixes, which brings input down but not output). Gemini 3.1 Pro lists at $1.25 input / $10 output per million for prompts up to 200K tokens on the Google AI pricing page, climbing for longer context. GPT-5.5 and Qwen3.7-Max land in similar ranges based on public API listings. None of the closed-flagship peers cross under $5 per million output, and none publish a cache hit tier under $0.05.
The recent Qwen3.7-Max launch from Alibaba positioned itself on hallucination floor and 35-hour autonomous runs, with pricing aimed at the same enterprise reasoning workload. V4-Pro counters that play not by matching the agent-endurance claim but by undercutting the per-token economics by 5x or more, so any team building agent loops where they pay for thousands of unsupervised steps now has a defensible reason to standardize on DeepSeek for the bulk of calls and escalate to a more expensive model only where the workload demands it.
The Cache-Optimized Workload Profile
Not every workload benefits equally. The V4-Pro economics are dominated by cache-hit ratio. Jobs with strong prompt reuse capture most of the win. Examples include retrieval pipelines where the system prompt and tool catalog are stable across calls, content rewriters that batch hundreds of documents through the same instructions, scheduled scrapers that classify scraped pages through a fixed prompt, persistent character chat where the personality and lore are pinned across a session, and code review agents where the rubric is the same on every file.
One-shot creative tasks (a single long-form essay, an unusual creative brief, a novel research question) do not get the cache-hit benefit because the leading prefix changes every call. Those workloads still benefit from the $0.435 cache miss tier, which is also lowest in class, but the gap to peers narrows from 17x to roughly 3x. The strategic implication is to audit your workload mix and route the high-reuse traffic to V4-Pro first while keeping the long-tail creative work on whichever vendor produces the best single-shot output.
What "Permanent" Signals About the Margin War
Temporary discounts are tactical. Permanent reprice resets the floor that competitors have to defend against. The signal from DeepSeek's Saturday announcement is that the lab does not believe the closed-flagship vendors can match this price without taking a margin hit they are not prepared to swallow. That is consistent with the cost structure of Chinese open-weights labs that have spent the last 18 months optimizing inference economics on domestic GPUs, including Huawei's Ascend line. The full margin picture remains opaque, but the public posture is that $0.87 per million output is sustainable.
Historical pattern says the rest of the category trends down within a quarter. The closed-flagship vendors are unlikely to match $0.87, but they will move. The most likely shape is a new "Pro Lite" or "Mini Reasoning" tier from the major vendors over the next two quarters that sits between current Flash/Haiku-tier pricing and Pro/Opus pricing, aimed at recapturing the volume that V4-Pro will pull. Antigravity-style first-party developer tools, like the Antigravity CLI push toward Gemini 3.5 Flash as the default agent model, point to the same direction: smaller flagships at lower prices, optimized for agent-loop economics rather than chat.
Impact on Creators
The practical takeaway is to reroute the highest-volume agent loop in your stack (RAG, content rewriter, persistent character chat, batch script) through V4-Pro for a 24-hour test window and compare to your current bill. The DeepSeek API is OpenAI-compatible, so for most apps the swap is a base_url change and a key swap. The largest savings show on workloads where the same system prompt or retrieved context appears in 80%+ of calls. For solo creators running newsletters, research assistants, content pipelines, or always-on personality bots, V4-Pro at the new floor likely cuts the monthly bill by 80% or more without a quality regression on coding and reasoning benchmarks, where V4-Pro already outperforms most peers.
For creators who care about brand alignment, voice, or specific aesthetic outputs (creative writing, scriptwriting, dialogue, style-matched copy), single-shot quality still differs across labs. The recommendation is hybrid: route the structured, repeatable, agent-style calls to V4-Pro, and keep the creative one-shot calls on whichever vendor matches your voice. The cache-hit math punishes single-shot diversity, so the routing decision matters more than it did at the old price floor.
Key Takeaways
- Permanent, not promo. The May 31 expiry is removed; V4-Pro pricing is locked at $0.435/$0.003625/$0.87 per 1M tokens for input miss/hit/output.
- Cache hits dominate the economics. Output cost is the bottleneck. Input is effectively free when prompt reuse crosses ~80%.
- 17x cheaper than peers on output. No closed-flagship reasoning model lists output under $5 per million.
- The swap is mechanical. OpenAI-compatible API means a base_url change and key swap for most stacks.
- Hybrid routing wins. Send repeatable, high-reuse traffic to V4-Pro and keep voice-sensitive single-shot work on your current vendor.
What to Watch
Three things will tell whether this is the new market structure or a short-lived outlier. First, watch for a Claude Opus 4.7 or Gemini 3.1 Pro repricing announcement within the next 60 days. A 30-50% cut on output pricing from either lab would signal that the closed-flagship tier is being forced to defend share. Second, watch for a "Mini Reasoning" tier from OpenAI, Anthropic, or Google over the next quarter that targets V4-Pro's economics directly without sacrificing the flagship line. The recently launched Gemini 3.5 Flash and the rumored Claude Sonnet 5 are the most likely vehicles. Third, watch the cache-hit usage data DeepSeek publishes. If the lab quietly raises the cache hit rate after the trial window, that is the tell that $0.003625 was a customer-acquisition number rather than a structural price. The next two quarters will determine whether the V4-Pro floor holds or whether it bends back upward once the migration wave finishes.