PrismML emerged from stealth on April 1, 2026 and immediately open-sourced its Bonsai series of large language models. The flagship Bonsai 8B packs 8.2 billion parameters into just 1.15 GB of memory, making it roughly 14x more compressed than comparable 16-bit models. Co-founded by Caltech mathematician Babak Hassibi, the company is betting that true 1-bit quantization can bring capable LLMs to devices that could never run them before.
For the broader landscape, see our open-source AI models 2026 creator reference.
What Happened
PrismML released three models under the Apache 2.0 license: Bonsai 8B (1.15 GB), Bonsai 4B (0.5 GB), and Bonsai 1.7B (0.24 GB). All three are available on HuggingFace, with an MLX version also provided for Apple Silicon users.
What separates Bonsai from previous quantization efforts is that every single weight in the model is either +1 or -1. That includes embeddings, attention layers, MLP blocks, and the output head. There are no high-precision patches, no mixed-precision tricks, and no fallback to higher bit widths for sensitive layers. It is 1-bit all the way through.
The speed benchmarks reflect what that compression enables. On an M4 Pro Mac, Bonsai 8B runs at 136 tokens per second. On an RTX 4090, it hits 440 tokens per second. On an iPhone 17 Pro Max, it delivers 44 tokens per second. For context, a standard 16-bit 8B model cannot fit on any iPhone at all. Energy consumption is also reduced by approximately 4-5x compared to the 16-bit equivalent.
Why It Matters for Creators
The creative AI workflow is increasingly defined by where models can run. Cloud inference adds latency, costs money per token, and requires an internet connection. Local inference removes all three constraints, but only if the model actually fits on the hardware you own.
Bonsai 8B changes that equation significantly. At 1.15 GB, it fits comfortably on laptops, desktops, and even phones. Creators working with Apple Silicon machines or AMD Ryzen AI desktops can run an 8B-parameter model without dedicating most of their system memory to it. That leaves room to run image generation, video tools, and other resource-heavy applications alongside a local LLM.
The phone benchmark is particularly notable. Running a capable language model at 44 tokens per second on a phone opens up entirely offline creative assistants, brainstorming tools, and writing aids that work without a network connection and without sending data to any server.
The Apache 2.0 license means there are no restrictions on commercial use, modification, or redistribution. Developers can integrate Bonsai into creative tools, plugins, and workflows without licensing concerns.
What to Do Next
Download Bonsai 8B from HuggingFace and test it with llama.cpp or any GGUF-compatible runner. Apple Silicon users should try the MLX version for optimized performance. Read the full technical details on the PrismML announcement.
If you are already running local models, compare Bonsai 8B against your current setup. The 1.15 GB footprint means you can run it alongside other tools without memory pressure. For mobile developers, the Bonsai 1.7B at just 0.24 GB is worth exploring for on-device applications where even half a gigabyte is a stretch.