
Enterprises juggling separate models for reasoning, multimodal tasks, and agent coding can simplify their stacks: Mistral’s new Small 4 brings all three into a single open-source model, with an adjustable level of reasoning.
The Small 4 enters the crowded field of small models – incl Gwen and Claude Haiku — competitive on output cost and benchmark performance. Mistral’s voice: lower latency and shorter outputs, which translate into cheaper tokens.
Mistral Small 4 updates Mistral Small 3.2 due in June 2025, and is available under the Apache 2.0 license. “With Small 4, users no longer have to choose between a faster instruction model, a powerful reasoning engine, or a multimodal assistant: one model now provides all three with configurable reasoning effort and best-in-class efficiency,” Mistral said. blog post.
Despite its small size – Mistral Small 4 has 119 billion total parameters with only 6 billion active parameters per token – the model combines the capabilities of all Mistral models, the company said. It has the reasoning capabilities of Trunk, the multimodal understanding of Pixtral, and the agent coding performance of Devstral. It also has a 256K context window, which the company says works well for long conversations and analysis.
Rob May, co-founder and CEO of Neurometric Small Language Model Market, told VentureBeat that the Mistral Small 4 stands out for its architectural flexibility. However, it joins a growing number of smaller models that it says risk adding more fragmentation to the market.
"From a technical point of view, yes, it can be competitive with other models,” said May. “The biggest issue is that it has to clear the market confusion. Mistral must first win the deal to get a chance to be a part of this test suite. Only then can they show the technical capabilities of the model.”
Reasoning on demand
Smaller models still offer good options for business founders who want the same LLM experience at a lower cost.
The model is built on specialist blend architecture like other Mistral models. According to Mistral, it consists of 128 experts with four active tokens each, which allows for efficient scaling and specialization.
This allows the Mistral Small 4 to respond faster, even to decisions that require more thought. It can also process and reason text and images, allowing users to analyze documents and graphics.
The model has a new parameter called causality_effort that allows users to “dynamically adjust the behavior of the model,” Mistral said. Enterprises can configure Small 4 to provide quick, lightweight answers in the same style as Mistral Small 3.2, or make it more verbose in the vein of Trunk, providing step-by-step reasoning for complex tasks, according to Mistral.
Mistral said the Small 4 runs on fewer chips than comparable models with four Nvidia HGX H100 or H200 or two Nvidia DGX B200 setups.
“Delivering advanced open-source AI models requires extensive optimization. Through close collaboration with Nvidia, the result is optimized for both open-source vLLM and SGLang, providing an efficient, high-performance service across deployment scenarios,” Mistral said.
Benchmark performances
According to Mistral’s benchmarks, the Small 4 performs close to the Mistral Medium 3.1 and Mistral Large 3, especially in MMLU Pro.
Post-instruction performance makes the Small 4 suitable for high-volume enterprise tasks such as document understanding, Mistral said.
Although it competes with other small models from other companies, the Small 4 still underperforms other popular open source models, especially on thought-intensive tasks. Qwen 3.5 122B and Qwen 3-next 80B outperform Small 4 in LiveCodeBench, as does Claude Haiku in tutorial mode.
Mistral Small 4 was able to defeat OpenAI’s GPT-OSS 120B in LCR.
Mistral claims the Small 4 achieves these scores with “significantly shorter results” that translate to less lag and lag than other models. Especially in tutorial mode, the Small 4 delivers the shortest outputs of any model tested – 2.1K characters versus 14.2K for the Claude Haiku and 23.6K for the GPT-OSS 120B. Outputs in think mode are longer (18.7K), which is expected for this use case.
May said that while the choice of model depends on the organization’s goals, latency is one of the three pillars they should prioritize. “It depends on your goals and what you’re optimizing your architecture to accomplish. Enterprises should prioritize these three pillars: reliability and structured product, intelligence ratio latency, fine-tunability, and privacy,” May said.





