Researchers from CMU, Princeton, Together AI, and Cartesia AI published Mamba-3, a state-space model architecture that matches transformer performance while running inference 7x faster. Presented at ICLR 2026 and detailed on the Together AI blog, the architecture introduces three core improvements that close the quality gap with transformers while maintaining the efficiency advantages of state-space models.
What Happened
Mamba-3 builds on the earlier Mamba and Mamba-2 architectures with three technical advances. First, exponential-trapezoidal discretization improves how the model converts continuous dynamics into discrete steps, resulting in better sequence modeling accuracy. Second, complex-valued state updates allow the model to represent richer patterns in data using complex numbers rather than real-valued states alone. Third, a MIMO (multiple-input, multiple-output) formulation enables the model to process and generate multiple signal streams simultaneously.
At the 1.5 billion parameter scale, Mamba-3 outperforms its predecessor Mamba-2 and matches Meta's Llama-3.2-1B across standard language benchmarks. The speed difference is dramatic: Mamba-3 SISO completes a benchmark sequence (n=16,384) in 140.61 seconds compared to 976.50 seconds for Llama-3.2-1B. That 7x speedup comes from the fundamental architectural difference between state-space models and transformers. Transformers scale quadratically with sequence length, while SSMs scale linearly.
The full implementation is available on GitHub under the state-spaces organization, continuing the open-source tradition of the Mamba project.
Why It Matters
Transformers have dominated AI model architecture since 2017, but their quadratic scaling with sequence length creates real cost and latency problems as context windows grow longer. Every doubling of context length quadruples the compute required. State-space models like Mamba-3 avoid this entirely, scaling linearly instead.
Until now, the tradeoff was clear: SSMs were faster but less capable. Mamba-3 narrows that gap to the point where the performance difference may no longer justify the cost difference for many applications. For creative AI workflows that process long documents, extended conversations, or multi-step generation pipelines, 7x faster inference translates directly into lower costs and shorter wait times.
This is part of a broader wave of open-source AI innovation in March 2026 that is expanding the options available beyond the dominant transformer paradigm. If SSMs continue closing the quality gap, the economic pressure on transformer-based services could accelerate adoption of hybrid or pure SSM architectures.
Key Details
- Architecture: State-space model with exponential-trapezoidal discretization, complex-valued states, MIMO formulation
- Scale tested: 1.5 billion parameters
- Speed: 140.61s vs 976.50s for Llama-3.2-1B (7x faster at n=16,384)
- Quality: Matches Llama-3.2-1B, outperforms Mamba-2
- License: Open source on GitHub
- Published: ICLR 2026, arXiv March 16
What to Do Next
If you work with long-context applications or high-volume inference pipelines, clone the Mamba-3 repository and benchmark it against your current transformer setup on your specific workload. The 7x speed advantage is measured on a specific benchmark configuration, so your results will vary depending on sequence length and task type. For teams spending heavily on inference compute, even a partial speedup at comparable quality could meaningfully reduce costs.