Open-source Python framework Forge adds structured guardrails to locally hosted 8B language models, enabling agentic task reliability that previously required cloud-scale models. Released in February 2026 by developer Antoine Zambelli, Forge drew 234 upvotes on a May 19 Hacker News Show HN post and surfaced in communities building self-hosted AI agent workflows.
What Happened
Zambelli published Forge on GitHub as an MIT-licensed Python package targeting creators and developers running AI agents on local hardware. The framework acts as a middleware layer between application code and a locally running model, adding reliability scaffolding that compact models typically lack. The project is backed by peer-reviewed research and includes a 26-scenario evaluation suite to benchmark your own model configurations.
The top configuration tested -- Ministral-3 8B Instruct Q8 on llama-server -- scores 86.5% across the full evaluation suite, and 76% on the advanced reasoning tier. These results use the quantized 8B model running entirely offline on local hardware.
Why It Matters
Small local models frequently fail at agentic tasks: they produce malformed tool calls, skip required steps, or loop without progress. The standard fix is upgrading to a larger, more expensive cloud model. One widely cited experiment showed 100 cloud-based agents consuming $1.3 million in API tokens in a single month -- costs that make local alternatives worth the engineering effort even with reliability tradeoffs.
Forge targets that tradeoff directly. Its guardrails run in-process alongside models served by llama.cpp or Ollama, catching and correcting model errors before they cascade into broken agent runs. The approach works on hardware as modest as a 16GB VRAM gaming GPU.
Key Details
Forge ships three integration modes:
- WorkflowRunner -- A structured agent loop with lifecycle management. Provide a model endpoint and a tool list; Forge handles error correction and retry logic automatically.
- Guardrails middleware -- Composable components that attach to an existing orchestration setup (LangChain, CrewAI, custom pipelines) without rewriting agent logic.
- Proxy server -- An OpenAI-compatible API endpoint that applies guardrails transparently. Any tool that speaks the OpenAI format works unchanged, including n8n workflows.
Supported local runtimes include Ollama, llama-server (llama.cpp), and Llamafile. The Anthropic API is also supported for hybrid setups. Core guardrails cover rescue parsing (fixing broken JSON tool calls), retry nudges, step enforcement, and VRAM-aware context compaction.
What to Do Next
If you run AI workflows on local hardware, start with the WorkflowRunner mode to benchmark your current model against Forge's eval suite -- the project includes a step-by-step Eval Guide so you can directly compare. The middleware mode is worth exploring once you have an existing agent pipeline that needs reliability improvements without a full rewrite. MIT license, Python 3.10+, no mandatory cloud dependency. See also: Zerostack's approach to lightweight 8MB coding agents for a different take on compact self-hosted AI agent design.