A paper accepted at ICML 2026 has found that transformer models do not require all three of their standard attention projections to perform well. The research, published in Proceedings of Machine Learning Research, demonstrates that sharing key and value projections cuts memory overhead by 50% with only 3.1% accuracy loss. Combined with existing efficiency techniques, the reduction reaches 96.9%.
What Happened
Transformer models use three projection matrices in their attention mechanism: query (Q), key (K), and value (V). Understanding how each projection shapes what the model attends to clarifies why sharing them is surprisingly safe. Researchers Ali Kayyam, Anusha Madan Gopal, and M Anthony Lewis systematically tested three sharing constraints across synthetic tasks, vision benchmarks, and language modeling to determine whether merging projections hurts performance.
The answer, across 300M and 1.2B parameter language models trained on 10 billion tokens: largely no. Q-K=V sharing (using the same matrix for key and value) achieves 50% KV cache reduction with 3.1% perplexity degradation. When stacked with group-query attention (GQA-4), that rises to 87.5% reduction. With multi-query attention (MQA), it reaches 96.9%.
Why It Matters for Creators
The KV cache is one of the most significant memory consumers in transformer inference. It commonly takes over 30% of GPU memory during deployment and limits how long an input can be, how many requests can run in parallel, and what hardware a model can run on.
Reducing the KV cache directly translates to: more context per request, faster generation times, lower cost per API call, and larger models running on consumer hardware. For creators using local AI tools for image generation, video transcription, or long-form writing assistance, this research points toward AI models that are both more capable and more accessible.
The paper confirms that projection sharing complements existing methods rather than replacing them. Model developers can stack these gains, meaning real-world tools built on future transformer architectures stand to benefit significantly.
Key Details
- Primary technique: Q-K=V sharing (same projection matrix for key and value)
- Accuracy cost: 3.1% perplexity degradation on language modeling
- Maximum memory gain: 96.9% KV cache reduction when combined with MQA
- Tested on: Vision (MNIST, CIFAR, TinyImageNet) and language modeling at 300M and 1.2B scale
- Code: Open-sourced through the Brainchip-Inc GitHub organization
- Venue: ICML 2026 (PMLR Vol. 306)
What to Do Next
No workflow change is needed today. This is architectural research that will influence future model releases and fine-tuning toolkits. Watch for this technique to appear in open-source frameworks like llama.cpp and the Hugging Face transformers library over the coming months. Models trained with projection sharing will be faster to run locally, use less VRAM, and handle longer contexts, all without requiring different hardware on your end. The code is already available for developers to experiment with.