On May 6, 2026, Redis creator Salvatore Sanfilippo (antirez) published ds4, a Metal-only inference engine that runs DeepSeek V4 Flash on a 128GB MacBook Pro. The repository climbed to the front page of Hacker News with 390 points the same day, in part because of who wrote it and in part because the engine ships at 26 tokens per second of generation on consumer Apple hardware for a model whose published checkpoints are 671B parameters with 37B active. The reason that fits in 128GB is a custom 2-bit quantization scheme that only touches the MoE expert weights, leaving everything else untouched.

What ds4 Is

ds4 is a single-purpose inference engine. It is not a fork of llama.cpp, not a generic GGUF runner, and not a research project. From the README: "If you are not happy with AI-developed code, this software is not for you." The codebase is intentionally narrow, hand-tuned for a single model on a single hardware target.

Propertyds4
LicenseMIT
HardwareApple Silicon, Metal only (no CPU path, no CUDA, no AMD)
ModelDeepSeek V4 Flash, exclusively
Quantizationq2 (for 128GB) or q4 (for 256GB+)
ServerOpenAI / Anthropic-compatible HTTP API: /v1/chat/completions, /v1/completions, /v1/messages
Context1M tokens, with disk-persistent KV cache
Languagesantirez calls out English and Italian as "much better" than smaller open models

Antirez credits Georgi Gerganov and the llama.cpp project in the README for their work on the GGUF format and Metal kernels that ds4 builds on. The engine itself was developed with "strong assistance from GPT 5.5," and the README is upfront about that.

Performance Numbers

The repository publishes benchmark numbers for the q2 build on two reference machines. All measurements use a 32K context window with thinking disabled, greedy decoding, and 256 generated tokens:

Speedometer dial showing 26 tokens per second
MachinePrefill (short)Prefill (long, ~12K)Generation
MacBook Pro M3 Max, 128GB58.52 t/s250.11 t/s26.68 t/s (short) / 21.47 t/s (long)
Mac Studio M3 Ultra, 512GB84.43 t/s468.03 t/s36.86 t/s

For context, those generation numbers put a 671B-parameter model in the same throughput bracket as a 70B dense model running on the same Mac through llama.cpp, on a per-token basis. The win comes from the asymmetric quantization scheme described below.

The Quantization Trick

The reason ds4 fits inside 128GB at all is a "very asymmetrical" quantization scheme. From the README:

2-bit quantization block showing memory savings
  • Routed MoE experts are aggressively quantized: up and gate projections at IQ2_XXS, down projection at Q2_K.
  • Shared experts, attention projections, and the routing layer are left untouched.

The bet is that DeepSeek V4 Flash's quality is concentrated in the layers that are visited every token (attention, routing, shared experts), while individual routed experts are visited less often and tolerate aggressive compression. That mirrors patterns the open-source quantization community has been documenting since DeepSeek V3, and antirez's contribution is bundling it into a turn-key engine for one specific model.

The q4 build for 256GB+ machines keeps the same shape but raises the routed-expert precision, trading roughly 2x the disk and memory footprint for less degradation.

Quickstart: Run DeepSeek V4 Flash on Your Mac

The following is the workflow ds4's README documents. Total time from empty repo to first token: about 30 minutes on a 1Gbps connection, mostly model download.

  1. Clone and build.
    git clone https://github.com/antirez/ds4.git
    cd ds4
    make
    The build is C with Metal shaders. There is no Python, no CMake, no Cargo. make is the only build step.
  2. Download the model. antirez hosts pre-quantized weights on huggingface.co/antirez/deepseek-v4-gguf.
    ./download_model.sh q2   # for 128GB machines
    ./download_model.sh q4   # for 256GB+ machines
    The script wraps curl with resume support, which matters because the q2 build is large enough that an interrupted download is expensive.
  3. Start the server.
    ./ds4-server --ctx 32768 --port 8080
    The HTTP API is OpenAI-compatible at /v1/chat/completions and Anthropic-compatible at /v1/messages, so any client that speaks either protocol works without code changes. Point your existing Cursor, Claude Desktop, or Open Web UI configuration at http://localhost:8080.
  4. Persist KV cache. ds4 treats the KV cache as a first-class disk artifact. A long conversation can be saved, killed, and reloaded later without re-running prefill on the original prompt. That is the single biggest behavioral difference from a stock llama.cpp setup, and it is what makes 1M-token contexts practical on a personal machine.

Why This Matters For Creators

Three concrete things change for creator-tool builders running local inference:

MacBook Pro with 128GB running local AI terminal
  • Local frontier-class quality without the $80K rig. Until ds4, a usable local DeepSeek V4 Flash setup required a multi-GPU server. A 128GB MacBook Pro is now sufficient. That puts a frontier-class open model in reach of a freelance developer or a small studio's lead engineer.
  • 1M-token context that survives a reboot. Disk-persistent KV cache is the kind of plumbing that does not show up in benchmark charts but reshapes what you can do with a model. Long-running creative projects (a code-base review, a multi-day design conversation, a manuscript draft) can resume rather than reprefill.
  • OpenAI/Anthropic-compatible endpoints out of the box. Every tool already wired to those protocols, including the bulk of Cursor's programmatic agent surface and Claude Code's local model support, can swap in a localhost URL and keep working.

Where ds4 Is Not the Right Tool

The README is explicit about the boundaries, and they matter:

  • Single model. If you want to run Qwen, Llama, Mistral, or any other open model, llama.cpp or Ollama remain the right answer. ds4 will not load anything other than DeepSeek V4 Flash.
  • Apple-only. No CUDA path, no AMD path, no x86 CPU fallback. A Linux server with NVIDIA hardware should look at vLLM or sglang for DeepSeek V4 instead.
  • No fine-tuning, no LoRA. ds4 is an inference engine. Training-side work belongs elsewhere.
  • No web UI. The server speaks HTTP. Bring your own client.

What To Do This Weekend

If you have a 128GB or 512GB Apple Silicon machine and a creative-AI workflow that currently rents Anthropic or OpenAI tokens, the experiment is small: clone ds4, run the q2 build, point your existing tool at http://localhost:8080, and measure two numbers. First, end-to-end latency on a real task. Second, the cost-per-month delta versus the API spend. If both numbers move the right way, you have just collapsed an external dependency. If not, the experiment took an afternoon and an SSD.

Either way, the publish event is a marker. A frontier-class open model on a personal machine, with disk-persistent context and OpenAI-compatible endpoints, is the new floor for what local inference looks like in 2026.

Frequently asked questions

What hardware do I need to run ds4?

An Apple Silicon Mac with 128GB unified memory minimum for the q2 build. The reference machines in the README are a MacBook Pro M3 Max with 128GB and a Mac Studio M3 Ultra with 512GB. The 512GB machine can run the higher-quality q4 build. Intel Macs and any non-Apple hardware are not supported.

Is DeepSeek V4 Flash actually good enough to replace Claude or GPT for daily work?

For coding, math, and English/Italian writing, antirez calls it "a quasi-frontier model." Independent benchmarks suggest DeepSeek V4 Flash is competitive with mid-tier proprietary offerings on reasoning tasks. It is not a drop-in replacement for the absolute frontier models on tasks that depend on tool use ergonomics or multimodal input.

How does ds4 compare to llama.cpp running the same model?

ds4 is hand-tuned for one model on one hardware path; llama.cpp is general. The asymmetric quantization, custom Metal kernels, and disk-persistent KV cache let ds4 run DeepSeek V4 Flash in 128GB where a stock llama.cpp setup typically cannot. For any other model, llama.cpp remains the right answer.

What is the licensing situation for the model and the engine?

The ds4 engine is MIT licensed. The DeepSeek V4 Flash weights ship under DeepSeek's model license (open weights with usage terms). Read both before shipping a commercial product on top of either.

Does ds4 work with my existing Cursor or Claude Code setup?

Yes. The server exposes OpenAI-compatible /v1/chat/completions and Anthropic-compatible /v1/messages endpoints. Configure your tool with a custom base URL pointing to localhost and a placeholder API key, and existing client code keeps working.

Why only Apple Silicon?

The README says it directly: macOS virtual memory behavior on Apple Silicon, combined with Metal's compute model, lets ds4 hold a 671B-parameter model's weights efficiently in unified memory. The CPU path was disabled because of macOS-specific paging issues. A Linux+NVIDIA setup has different and arguably easier tradeoffs, which is why the project's scope stops at Metal.

What does the disk-persistent KV cache change in practice?

Long contexts survive a reboot. A research conversation, a multi-file code review, or a long-running agent session can be saved to disk and reloaded without rerunning prefill on the original prompt. On a 1M-token context, that difference is measured in minutes per resume.