MiniMax M3 launched on June 1, 2026, becoming the first open-weights model to combine frontier-level coding performance, a 1-million-token context window, and native multimodal inputs that include both images and video. Priced at $0.60 per million input tokens, it enters the market at 8x to 12x lower cost than Claude Opus or GPT-5.5 while posting a 59.0% score on SWE-Bench Pro, narrowly beating GPT-5.5 (58.6%) on that benchmark.
What Happened

MiniMax, the Chinese AI lab behind the Hailuo video generation tools, announced M3 via API on May 31 with the full launch rolling out June 1. The model is built on a new attention mechanism called MiniMax Sparse Attention (MSA), which the company says solves the core bottleneck that has kept other labs from making million-token context practical at inference time.
The weights themselves are not yet downloadable. MiniMax stated they will release the model weights and a technical report within 10 days of launch, making this an open-weights release in the same category as Llama 3.1 and Mistral Large. The commercial license includes use-case conditions, so check the terms before building a product on top of it.
Access right now is through the MiniMax platform API at platform.minimax.io, a new code editor at code.minimax.io, and through OpenRouter where the model went live on June 1.
MiniMax Sparse Attention: The Architecture That Makes 1M Context Usable
Standard transformer attention scales quadratically with context length, meaning doubling context roughly quadruples compute. At 1 million tokens that math becomes prohibitive for real-time use. MSA addresses this by replacing full attention with a KV-block selection mechanism that evaluates only the most relevant context blocks for each token, cutting per-token compute by approximately 90% at long context.
The performance numbers versus M2 at million-token context are significant:
- Prefill speed: 9.7x faster
- Decoding speed: 15.6x faster
- Per-token compute: roughly 1/10th of M2
This matters because it makes sustained long-context use economically viable. You can submit an entire 400-page technical document, a full codebase, or hours of video transcript and get coherent analysis without waiting 45 seconds for the first token. For agentic workflows that loop over large corpora, the speed improvement changes what is feasible.
Benchmark Performance vs GPT-5.5, Gemini 3.1, and Claude Opus

MiniMax released a full benchmark table at launch. The results show M3 is competitive at the frontier level for coding and agentic tasks, with mixed results depending on which benchmark you weigh.
| Benchmark | MiniMax M3 | GPT-5.5 | Claude Opus 4.8 | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-Bench Pro (coding) | 59.0% | 58.6% | 69.2% | 54.2% |
| Terminal-Bench 2.1 | 66.0% | 72.1% | 74.2% | 70.0% |
| BrowseComp (autonomous browsing) | 83.5% | -- | -- | -- |
| SWE-fficiency | 34.8% | -- | -- | -- |
M3 edges GPT-5.5 on coding but trails Claude Opus 4.8 by roughly 10 percentage points. On Terminal-Bench it ranks below all three closed models. Where it genuinely stands out is BrowseComp, a benchmark measuring autonomous web research tasks, where it scores 83.5% against no direct published competitor score.
In a real-world test, MiniMax ran M3 on the task of reproducing an ICLR 2025 Outstanding Paper without human help. The model ran for nearly 12 hours, produced 18 commits, and generated 23 experimental figures before submitting the completed work. That kind of sustained autonomous performance over a long session puts M3 in the same class as frontier agentic models regardless of where individual benchmark scores fall.
What This Enables for Creative AI Workflows
Most discussions of M3 focus on coding. The more interesting angle for creative producers is what the model does with long-context multimodal input at low cost.
M3 accepts text, images, and video as input. That means you can submit a full-length video transcript alongside the raw frames, ask the model to identify the 10 best moments for a highlight reel, and get back a structured response with timestamps and rationale. At $0.30 per million tokens during the launch promotion, processing a 2-hour video with 512K tokens of combined context costs under $0.16.
For asset-heavy projects, the 1M context window means you can drop an entire project directory, all style guide documents, and a library of reference images into a single prompt session and get consistent creative direction across the full scope of the work. Models like Liquid AI LFM2.5 brought efficient on-device inference for focused tasks, but M3 targets the opposite end: massive multimodal context for complex, long-session workflows that require keeping the full picture in view.
The computer-use capability also means M3 can operate desktop software directly. Combined with a 1M context window and video input, you can build an agent that watches screen recordings of your workflow and suggests or applies optimizations without requiring you to write structured prompts for each step.
Pricing and Access

The launch pricing through OpenRouter runs a 50% discount for the first 7 days:
- Launch promo: $0.30/M input, $1.20/M output
- Standard: $0.60/M input, $2.40/M output
- Cache reads: $0.12/M tokens
At standard pricing, a 500K input / 100K output task costs $0.54. The same call on Claude Opus at $5.00/M input and $25.00/M output would run $3.00 for input alone. The cost difference compounds fast at scale.
Direct API access goes through platform.minimax.io. Priority API access is available by emailing api@minimax.io. The model weights, once released, will require inference engines that implement MSA natively. Standard vLLM and llama.cpp will need MSA support added before local self-hosted runs are practical.
What to Do Next
If you work with long documents, large codebases, or video production pipelines, M3 is worth testing immediately while the 50% launch discount is active. Start with a context-heavy task you already know how to benchmark: submit your largest document collection or reference library and compare output quality against your current workflow model.
Get access at platform.minimax.io or through OpenRouter if you already have API routing set up. Watch the MiniMax GitHub and Hugging Face organization for the weight release, expected within 10 days of June 1.
If you have a commercial use case, read the license before integrating. The weights are open but the terms may restrict certain deployment types.
Frequently Asked Questions
Is MiniMax M3 truly open weights?
Yes, with a delay. The API launched June 1, 2026, but model weights and the technical report will be released within 10 days. Once available they will be downloadable like Llama or Mistral weights, subject to the commercial license terms.
What modalities does M3 support?
MiniMax M3 accepts text, images, and video as inputs. It outputs text only. The native multimodal support means you do not need a separate vision model or preprocessing step for image or video analysis tasks.
How does the 1 million token context work in practice?
M3 guarantees a minimum of 512K tokens and scales to 1M via the MiniMax Sparse Attention mechanism. MSA cuts per-token compute by roughly 90% at long contexts compared to standard attention, making 1M-token calls feasible without prohibitive latency or cost.
How does M3 compare to Claude Opus for coding?
M3 scores 59.0% on SWE-Bench Pro versus Claude Opus 4.8 at 69.2%. Opus leads by about 10 points on that benchmark. On autonomous browsing (BrowseComp), M3 scores 83.5% while comparable Opus numbers have not been published. M3 costs 8 to 12 times less per token.
Can M3 use computers autonomously?
Yes. Computer-use capability is listed as a core feature, allowing the model to operate desktop software directly without requiring a separate tool-calling layer. This is part of the agentic suite alongside terminal access and autonomous browsing.
When will self-hosted local inference be available?
Weights arrive within 10 days of the June 1, 2026 launch. Running them locally will require inference engine support for MiniMax Sparse Attention. Standard vLLM and llama.cpp do not currently support MSA, so local inference will follow after the inference ecosystem adds MSA implementations.