
Engineering teams building agent coding pipelines now have a concrete open source alternative to managed models Representation of Claude 5 — one that works on a single H100. The trade-off: Cohere’s North Mini Code, launched Tuesday, produced three times the throughput of comparable models in independent testing, a detailed cost factor that drives consolidation in high-volume production workloads.
The new open source model is a 30 billion parameter expert mix (MN) model with 3 billion parameters active per token, built for agent software engineering including sub-agent orchestration, architecture mapping, code review, and terminal work. The model supports a context window of 256,000 tokens with a maximum generation length of 64,000 tokens and is available here . Hugging Face Licensed under Apache 2.0.
What North Mini Code can do
North Mini Code targets a full agent coding stack. Here’s what the model does and what it does.
Software engineering. Cohere has created North Minicode specifically for agent software engineering, not adapted from a general-purpose base. It features integrated tooling and supports nested thinking, which Cohere says improves performance in multistep agent jobs.
Architecture mapping and code review. North Mini Code can analyze and map systems architecture, surface dependencies, and review code across large codebases. With 256,000 token context windows, it can store significant multi-file projects in a single context switch.
Terminal based agent positionetc. The model is trained for terminal environments, shell interactions, package scripts, and command-line tools. Cohere benchmarked it against Terminal-Bench v2, which tests agents in real terminal environments rather than synthetic code generation tasks.
How is it built?
North Mini Code is a sparse expert mix model with 128 experts, 8 of which are activated per token. The resulting computational demand is close to 3 billion parameter models despite 30 billion total parameters. Nick Frosst, co-founder of Cohere, demonstrated it in Mac Studio About 20 gigabytes of RAM via MLX, the same machine he uses for his native coding work.
Cohere trained the model through two stages of supervised fine-tuning, followed by reinforcement learning with verifiable rewards on over 70,000 verifiable tasks covering nearly 5,000 repositories optimized against SWE-Bench.
Instead of optimizing against a single-agent scaffold, Cohere trained three. SWE-Agent uses a rich CLI with custom commands. Mini-SWE-Agent uses a single bash tool with raw shell output. OpenCode uses custom-written tools that return structured JSON. Cohere reports a 10 percentage point gain in OpenCode evaluation from the multi-harness approach while maintaining SWE-Agent performance.
Where it fits
North Mini Code now joins a market that includes Mistral Devstral Small 2, GitHub Copilot, Cursor, and Claude Fable 5 – each with different pricing and deployment tradeoffs.
Cohere’s primary benchmark comparison is reflection Mistral Devstral Small 2Dense model with 24 billion parameters. In internal tests reported by the vendor, Cohere claims a 2.8x higher throughput and a 30% inter-signal latency advantage than the Devstral Small 2 in internal tests under the same hardware configurations. Cohere also claims that his Hugging Face technical writingNorth Mini Code outperforms open-source models, including models with 120 billion parameters, with four times the number of parameters reported.
Artificial analysis independently ranks it eighth among 127 comparable open-weight models in output speed at 210 tokens per second, with a time to first token of 0.25 seconds versus a class average of 1.95 seconds. It ranks 18th out of 127 in the AI Index. A flag from the same data: the model generated 75 million output tokens to complete the Intelligence Index against a class median of 25 million. In high-volume agent pipelines, this granularity translates into inference cost and latency.
"Suddenly people are wondering, am I getting enough economic value out of tokens from a model?" Frosst said during an introductory video. "Localization is a way to empower people and make AI something that actually works for them."
GitHub Copilot, Cursor, and Claude Code operate on a per-use or subscription basis. Anthropic’s Claude Fable 5, a managed coding model now available to the most skilled public, runs for $50 per million output tokens. The model for Frost is the polar opposite of Fable.
"Small, affordable, apache 2.0 and can be used locally. LLMs are the way to go. small, open source, transparent and sovereign, large, expensive, proprietary and hegemonic," Frosst wrote in a Type in X.
What this means for businesses
For teams building production agent coding pipelines, the release of North Mini Code clarifies a number of decisions that have been building for months.
Targeted agent training is now key to evaluation. The distinction between models fine-tuned for code with validated tool calls and multiple coupling robustness, and models trained specifically for agent workflows is now an important factor in pipeline decisions. Any model vendor claiming agent coding capability should be able to answer whether its training uses verifiable agent tasks or is adapted from a general-purpose base.
Verbosity is a hidden pipeline value that benchmarks do not expose. Artificial Analysis measured North Mini Code and generated three times more output tokens than comparable models. This detail is compounded by inference cost and latency in high-volume pipelines. Performance testing against actual workload volume is an evaluation step that benchmark ratings skip.
Splitting boundary values is now a real architectural decision. At USD 50 per million access tokens, the Fable 5 and H100 alone represent a true compromise between cost control and data residency on the one hand, and managed infrastructure costs on the other. Teams working with high-volume agent coding pipelines should model both cost paths against actual workloads before implementing either.





