
Over the past two years, businesses evaluating open-weight models have faced awkward trading. Google’s Gemma line has consistently performed strongly, but its proprietary license — with usage restrictions and terms that Google can update at will — has pushed many teams toward Mistral or Alibaba’s Qwen. Legal review added friction. Compliance teams noted outliers. And skilled as Gemma 3, "open" It is not the same as open with asterisks.
Gemma 4 it completely eliminates friction. Google DeepMind’s newest family of open models ships by default Apache 2.0 license — Gwen, Mistral, Arcee, and the same allowed terms used by most of the openweight ecosystem.
There are no special provisions, no "Harmful Use" There are no restrictions on cuts, redistribution or commercial application that require legal interpretation. For enterprise teams waiting for Google to play by the same licensing terms as the rest of the field, the wait is over.
The timing is remarkable. As some Chinese AI labs (mainly Alibaba’s latest Qwen models, Qwen3.5 Omni and Qwen 3.6 Plus) start to back away from fully open releases for their latest models, Google is moving in the opposite direction – opening the most capable Gemma release yet and publicly declaring its architecture. Gemini 3 research.
Four models, two tiers: From edge to workstation in a single family
The Gemma 4 comes as four different models arranged on two placement levels. The "workstation" layer is included 31B-parameter dense model and a 26B A4B Expert Blend model — Supports both text and image input with 256K-token context windows. The "edge" consists of a step E2B and E4Bcompact models designed for phones, embedded devices and laptops, supporting text, image and audio with 128K-token context windows.
The naming convention requires some unpacking. The "E" stands for prefix "effective parameters" — E2B has 2.3 billion effective parameters, but only 5.1 billion because each decoder layer carries its own little layout table through a technique Google calls Per Layer Placements (PLE). These tables are large on disk but cheap to compute, so the model works like 2B and is technically heavier.
The "A" It means A4B in 26B "active settings" — Of the TN model’s 25.2 billion total parameters, only 3.8 billion are activated during inference, meaning it provides about 26B-class intelligence at a computational cost comparable to the 4B model.
For IT leaders determining their GPU requirements, this translates directly into deployment flexibility. The MoE model can run on consumer-grade GPUs and should appear quickly in tools like Ollama and LM Studio. The 31B tighter model requires more headroom—think NVIDIA H100 or RTX 6000 Pro for non-quantitative results—but Google also ships Quantization-Aware Training (QAT) checkpoints keeping the quality at a lower resolution. Both workstation models can now run in a fully serverless configuration on Google Cloud Cloud Run With the NVIDIA RTX Pro 6000 GPU, it drops to zero at idle.
TN bets: 128 small experts save costs to draw conclusions
The architectural choices of the 26B A4B model deserve special attention from teams evaluating the resulting economy. Instead of following the example of the last big TN models that used a few big experts, Google went 128 junior expertsactivates eight plus one shared always-on experts per token. The result is a model that is competitive with dense models in the 27B–31B range while running at about the speed of a 4B model in output.
This is not just a criterion of interest – it directly affects service costs. The model, which provides 27B-class cores with 4B-class throughput, means fewer GPUs in production, less latency, and the result for a cheaper markup. For organizations working with coding assistants, document processing pipelines, or multi-threaded agent workflows, the TN option may be the most practical option in the family.
Both workstation models use a hybrid attention mechanism This combines local sliding window focus with fully global focus, and the final layer always being global. This design allows for 256K context windows while keeping memory consumption manageable – an important consideration for teams processing long documents, codebases, or multithreaded agent conversations.
Native multimodality: Vision, audio, and function invocation baked from scratch
Previous generations of open models typically treated multimodality as an add-on. Vision encoders were attached to text trunks. Audio requires an external ASR pipeline such as Whisper. The function call was based on operational engineering and hoped that the model would cooperate. Gemma 4 combines all these capabilities at the architectural level.
All four models drive variable aspect ratio image input with configurable visual markup budgets — a significant improvement over Gemma 3n’s older vision encoder, which struggled with OCR and document understanding. The new encoder supports budgets from 70 to 1120 per image, allowing developers to replace detail with computation depending on the task.
Low budgets work for classification and heading; higher budgets handle OCR, document analysis, and fine-grained visual analysis. Multiple image and video input (processed as a frame sequence) is natively supported, allowing visual reasoning between multiple documents or screenshots.
Adds two edge models local audio processing — automatic speech recognition and speech-to-text translation, all on the device. The audio encoder has been compressed from 681 million to 305 million parameters on the Gemma 3n, with the frame time reduced from 160ms to 40ms for more responsive transcription. For teams building voice-first applications that need to keep data local—healthcare, field service, or multilingual customer interaction—managing ASR, translation, reasoning, and function invocation in a single model on the phone or external device is a true architectural simplification.
Function call Based on Google’s research, it’s native on all four models FunctionGemma released late last year. Unlike previous approaches based on structured tool usage instructions, Gemma 4’s function invocation was trained on the model from scratch – optimized for multi-loop agent flows with multiple tools. This is reflected in agency benchmarks, but more importantly, it reduces operational engineering costs that enterprise teams typically invest when building agents using tools.
Criteria in Context: Where Gemma 4 lands in a crowded area
Benchmark figures tell a clear story of generational development. 31B has dense model scores 89.2% like 2026 (rigorous mathematical reasoning test), 80.0% in LiveCodeBench v6and hits a 2,150 code powers ELO — soon to be borderline class numbers from proprietary models. For vision, MMMU Pro achieves 76.9% and MATH-Vision achieves 85.6%.
By comparison, the Gemma 3 27B scored 20.8% in AIME and 29.1% in LiveCodeBench without the thought mode.
TN follows the model closely: 88.3% in AIME 2026, 77.1% in LiveCodeBench, and 82.3% in GPQA Diamond—graduate-level science reasoning benchmarks. Given the significant inference cost advantage of the TN architecture, the performance difference between the TN and dense variants is modest.
Outliers punch above their weight class. The E4B scores 42.5% in AIME 2026 and 52.0% in LiveCodeBench – strong for a model running on a T4 GPU. E2B, smaller, controls 37.5% and 44.0% respectively. Despite both being a fraction of its size thanks to internal reasoning, the Gemma 3 significantly outperforms the 27B (without reasoning) in most benchmarks.
These numbers should be read against an increasingly competitive flyweight landscape. The Qwen 3.5, GLM-5 and Kimi K2.5 compete aggressively in this parameter range and the field is moving fast. What sets Gemma 4 apart is less of a single criterion and more of a combination: powerful reasoning, native multimodality across text, vision and audio, function calling, 256K context and a true permissive license – all in a single model family with deployment options from edge devices to cloud serverless.
Which enterprise teams should follow next
Google releases both pre-trained base models and manual tuning options, which is important for organizations planning tuning for specific domains. Gemma base models have historically been powerful bases for custom training, and the Apache 2.0 license now removes any uncertainty about whether fine-tuned derivatives can be applied commercially.
The serverless deployment option via Cloud Run with GPU support is worth a look for teams that need zero inference capability. Paying only for the actual computation at the end – instead of maintaining always-on GPU instances – can significantly change the economics of deploying open models in production, especially for embedded tools and lower-traffic applications.
Google has hinted that this may not be the full Gemma 4 family, and that additional model sizes will follow. But the combination available today — workstation-class reasoning models and edge-level multimodal models all under Apache 2.0, all drawn from Gemini 3 research — represents the most complete open model release Google has ever shipped. For enterprise teams expecting Google’s open models to compete on licensing terms as well as performance, the evaluation can finally begin without going legal in the first place.




