A Miami-based startup called Subquadratic launched a frontier-scale language model on May 5 with a 12 million token context window, an OpenAI-compatible API, and benchmark numbers that, if independently confirmed, would reset the long-context pricing floor by roughly two orders of magnitude. The company calls the model SubQ, and it ships behind early access alongside a coding-agent CLI named SubQ Code that plugs into Claude Code, Codex, and Cursor.

The headline claim from the official launch post: at the 128K context length used by the RULER benchmark, SubQ scored 95% accuracy for $8 in compute, while Anthropic's Claude Opus scored 94% for roughly $2,600. That is a 300x cost reduction for a one-point accuracy gap, on a public, NVIDIA-built benchmark. Subquadratic also raised $29 million at a $500 million valuation, with checks from former SoftBank Vision Fund partner Javier Villamizar and Tinder co-founder Justin Mateen.

What SubQ Ships Today

Tall charcoal block with an orange disc cap visualizing a 12 million token context window.
12 million token working context.

Three products are live in early access as of May 5:

  • SubQ API: a 12M-token-context model served behind OpenAI-compatible endpoints, with streaming and tool-use support. Subquadratic markets it as a drop-in for use cases that today require expensive Claude or Gemini calls to chew through entire repositories or pipeline state.
  • SubQ Code: a CLI that hooks into Claude Code, Codex, and Cursor as a long-context layer. The pitch is that when those agents would otherwise burn turns mapping a codebase, SubQ Code captures the full repository context once and answers from it. Subquadratic claims a 25% reduction in coding-agent bills and 10x faster exploration.
  • Search: a free search product that runs over uploaded corpora at 12M tokens.

The architecture is what Subquadratic calls Subquadratic Sparse Attention (SSA). Standard transformer attention scales quadratically with context length, so a 10x longer prompt costs roughly 100x more compute. SSA, per the company's technical writeup, is structured to scale linearly. A full technical report is listed as "coming soon" rather than published, which matters for the skepticism point below.

For frame of reference, the current frontier ceilings sit at 200K tokens for Claude Opus, around 256K for GPT-5.5 Instant, and 2M tokens for Gemini 2.5 Pro. SubQ's claim, if it holds, is roughly 6x the largest commercially available context, served at a small fraction of the cost.

The Numbers Subquadratic Claims

Side-by-side cylinders, the right one short and orange, illustrating 300x lower cost.
300x lower cost than Transformer baselines at the same context length.

The launch announcement frames performance in three buckets, all on benchmarks that creators or coding agents actually hit in production. The comparison numbers reported by the company:

Context lengthSubQ vs frontier (claimed)What this would mean
128K tokens (RULER)95% accuracy at $8 vs Claude Opus 94% at $2,600~300x cost reduction at near-identical retrieval quality
1M tokens50x faster, 50x cheaper than leading frontier modelsLong-context queries (entire codebases, hour-long transcripts) become routine instead of premium
12M tokens~1,000x lower compute than equivalent quadratic-attention runsSingle-call analysis of ~120 books or a year of meeting transcripts

The RULER benchmark itself is real and well-respected. It is a NVIDIA-authored, COLM 2024 paper (arXiv 2404.06654) that goes beyond the easily-gamed needle-in-a-haystack test to include multi-hop tracing and aggregation tasks. The RULER team's prior finding was that most models that claim 32K+ contexts fail to maintain quality past that length. A 95% RULER 128K score, if reproducible, is genuinely competitive with the top of the long-context leaderboard, and the cost gap is what makes the claim notable. The full RULER repository is open-source for anyone wanting to reproduce.

What This Unlocks for Creators

Smooth gently curving charcoal ramp with an orange line tracing sub-quadratic scaling.
Sub-quadratic scaling. Cost rises slowly as context grows, then plateaus.

The 12M context number sounds like an engineering trophy. The practical version, if the API behaves the way Subquadratic says, is more concrete:

  • Whole-codebase coding agents. Today, Claude Code or Cursor either summarize repositories into compressed chunks or pay quadratic costs for every full-context turn. SubQ Code's pitch is to skip that compromise and feed the agent the entire repo as raw context. For a creator building tools or shipping side projects, that means coding agents that genuinely understand cross-file consequences instead of guessing from a sample. See our 2026 AI coding tools guide for current incumbents.
  • Hour-long video transcripts in one shot. A 60-minute YouTube transcript runs around 8K-12K tokens. A series of 50 such episodes fits inside SubQ's window with room to spare. For YouTubers, podcasters, or course creators, that means analytical workflows ("find every product the host endorsed in the last year, with timestamps") become a single API call rather than a chunked retrieval pipeline.
  • Long-form scriptwriting and continuity. Screenwriters and game-narrative writers currently rely on RAG to keep a model consistent with prior episodes or character bibles. A 12M context fits an entire shot script library, every revision history, and the show bible inline. Continuity drift becomes a tractable problem rather than an inevitable one.
  • Research synthesis at corpus scale. Reading 120 books in a single prompt is the toy example. The realistic version: drop in a year of arXiv papers from a subfield, ask for the timeline of a contested claim, get a literature-review-quality answer. The cost basis ($8 at 128K, scaling near-linearly) makes this cheap enough to iterate on.

The catch is that none of this works if the claims do not hold up under independent stress tests. Which is the next section.

The Skepticism Around the Claims

Three caveats every creator should keep in mind before architecting around SubQ:

No third-party verification yet. As of the May 5 launch, no independent group has reproduced the RULER 128K numbers or the 1,000x compute claim. The numbers are credible enough to take seriously but not yet credible enough to bet a production pipeline on.

The full technical paper is unpublished. The SSA writeup on Subquadratic's site is a marketing-flavored sketch, not a peer-reviewable description of the architecture. The full technical report is listed as "coming soon." Sparse-attention papers have a long history of impressive paper numbers that fail to translate when models scale. Linear-attention research from 2020-2024 hit this exact wall repeatedly.

Early access gating. The API is not openly available. Pricing is not published. The "request early access" friction means the first wave of independent benchmarks will come from a hand-picked set of partners, not the open community that broke prior long-context claims. Watch for benchmarks from outlets that did not get early access before treating the cost numbers as load-bearing.

The 12M-token-but-only-research-config detail. Multiple launch coverage outlets note the 12M context is "in research configuration," with the production API capacity still being clarified. If the production tier serves a smaller window, the 1,000x compute claim shrinks to whatever the practical ceiling actually is.

What to Try This Week

The action item for creators today is small and specific: request early access on the Subquadratic site and queue up a workload that the current generation of long-context models handles badly. Three concrete tests that will tell you whether SubQ holds up:

  1. Drop a single 200-300K-token codebase into SubQ Code (or the API) and ask cross-file questions Claude Code currently muddles. The win condition is a correct answer on the first turn.
  2. Embed a 4-hour podcast transcript in a single API call and ask for a structured timeline of guest claims with timestamps. The win condition is timestamps within 30 seconds of ground truth.
  3. Run RULER 128K against your own dataset (the benchmark is on NVIDIA's GitHub) and compare SubQ's accuracy and dollar cost head-to-head against your current incumbent. The win condition is reproducing the claimed 300x cost gap, not the accuracy gap.

If two of three tests land, SubQ is the cheapest route to a creative workflow that today gets gated by context cost. If they do not, the launch is one more in a long line of ambitious sparse-attention announcements that do not survive contact with production.

Frequently asked questions

Is SubQ open source or closed?

Closed. SubQ is served as a hosted API behind early access. The architecture writeup is published in marketing form, the full technical report is listed as "coming soon," and the model weights are not available. Treat it the same way you treat Claude or GPT-5.5: an opaque API with claimed properties.

Will SubQ replace Claude or GPT-5.5 for short-context work?

Almost certainly not. The headline numbers all live at long context. At 4K or 32K tokens, Claude Opus and GPT-5.5 already produce excellent answers at low cost, and there is no obvious reason a sparse-attention architecture would beat them on short prompts. SubQ's wedge is the long-context regime: agents handling whole codebases, transcripts, or document corpora.

Does SubQ Code work alongside Claude Code or replace it?

Alongside. Per the Subquadratic product page, SubQ Code is a CLI that plugs into existing coding agents (Claude Code, Codex, Cursor) and serves as a long-context layer. Expensive cross-repo turns get redirected through SubQ; the rest of the agent's behavior stays on the incumbent. The claim is a 25% reduction in coding-agent bills and 10x faster repo exploration, not a wholesale replacement.

How does SubQ compare to Gemini's 2M context?

Gemini 2.5 Pro is the only currently shipping API in the multi-million-token range, and its 2M ceiling is six times smaller than SubQ's claimed 12M. Per-token pricing is the unknown until SubQ publishes a rate card. If Subquadratic lands the 50x cost reduction at 1M tokens it is claiming, Gemini's long-context economics get harder to defend; if it does not, Gemini remains the practical default.

Is the $29M raise enough for a frontier-scale lab?

Not on its own. $29M at a $500M valuation buys a year or two of compute for a small frontier-scale lab, not the multi-year runway labs like Anthropic and OpenAI operate on. Watch for a Series A within 12 months. The architecture's actual capital efficiency is the load-bearing question: if SSA truly costs a fraction of quadratic attention to train, the funding gap matters less than it does for a standard transformer lab.