ICML 2026: AI Models Could Run on 97% Less Memory

A paper accepted at ICML 2026 has found that transformer models do not require all three of their standard attention projections to perform well. The research, published in Proceedings of Machine Learning Research, demonstrates that sharing key and value projections cuts memory overhead by 50% with only 3.1% accuracy loss. Combined with existing efficiency techniques, the reduction reaches 96.9%.

What Happened

Transformer models use three projection matrices in their attention mechanism: query (Q), key (K), and value (V). Understanding how each projection shapes what the model attends to clarifies why sharing them is surprisingly safe. Researchers Ali Kayyam, Anusha Madan Gopal, and M Anthony Lewis systematically tested three sharing constraints across synthetic tasks, vision benchmarks, and language modeling to determine whether merging projections hurts performance.

The answer, across 300M and 1.2B parameter language models trained on 10 billion tokens: largely no. Q-K=V sharing (using the same matrix for key and value) achieves 50% KV cache reduction with 3.1% perplexity degradation. When stacked with group-query attention (GQA-4), that rises to 87.5% reduction. With multi-query attention (MQA), it reaches 96.9%.

Why It Matters for Creators

The KV cache is one of the most significant memory consumers in transformer inference. It commonly takes over 30% of GPU memory during deployment and limits how long an input can be, how many requests can run in parallel, and what hardware a model can run on.

Reducing the KV cache directly translates to: more context per request, faster generation times, lower cost per API call, and larger models running on consumer hardware. For creators using local AI tools for image generation, video transcription, or long-form writing assistance, this research points toward AI models that are both more capable and more accessible.

The paper confirms that projection sharing complements existing methods rather than replacing them. Model developers can stack these gains, meaning real-world tools built on future transformer architectures stand to benefit significantly.

Key Details

Primary technique: Q-K=V sharing (same projection matrix for key and value)
Accuracy cost: 3.1% perplexity degradation on language modeling
Maximum memory gain: 96.9% KV cache reduction when combined with MQA
Tested on: Vision (MNIST, CIFAR, TinyImageNet) and language modeling at 300M and 1.2B scale
Code: Open-sourced through the Brainchip-Inc GitHub organization
Venue: ICML 2026 (PMLR Vol. 306)

What to Do Next

No workflow change is needed today. This is architectural research that will influence future model releases and fine-tuning toolkits. Watch for this technique to appear in open-source frameworks like llama.cpp and the Hugging Face transformers library over the coming months. Models trained with projection sharing will be faster to run locally, use less VRAM, and handle longer contexts, all without requiring different hardware on your end. The code is already available for developers to experiment with.

ICML 2026: AI Models Could Run on 97% Less Memory

What Happened

Why It Matters for Creators

Key Details

What to Do Next

Keep reading

Manim-Studio Turns Text Prompts Into Math Animations

Shutterstock Turns Its Stock Library Into an AI Platform

The Best AI Music Generators in 2026: Suno, Udio, ElevenLabs and More

What Happened

Why It Matters for Creators

Key Details

What to Do Next

Stay ahead of AI

Keep reading

Manim-Studio Turns Text Prompts Into Math Animations

Shutterstock Turns Its Stock Library Into an AI Platform

The Best AI Music Generators in 2026: Suno, Udio, ElevenLabs and More

Stay ahead of Creative AI