A 4B model, 128K context window, offline, on a Pi 5. Multimodal: images, video, audio. Apache 2.0 license, no commercia...
A 4B model, 128K context window, offline, on a Pi 5. Multimodal: images, video, audio. Apache 2.0 license, no commercial usage clause. Ranked #3 open model on Arena AI's global leaderboard at 31B. That's the week Google DeepMind had on April 2nd.
The edge story is the thing I can't move past. Specifically the E4B: 4.5B effective parameters, 8B with embeddings, running on the same class of hardware that people duct-tape to the back of kiosks and industrial sensors. The context window on that device is 128K tokens. The larger 26B MoE and 31B dense models push to 256K. For comparison, most open models at this price point were barely touching 8K context a year ago, and "multimodal" meant you bolted a vision encoder on the side and prayed the outputs were coherent.
The 26B MoE is the model I'd actually run if inference cost is a real constraint. 26B total parameters, 3.8B activated at inference time. That's not a rounding trick. That's what Mixture-of-Experts does when the routing is well-designed. It currently sits at #6 on Arena AI's open model leaderboard, outperforming models up to 20x its parameter count. The 31B dense lands at #3 with 89.2% on AIME 2026 math (no tools), 84.3% on GPQA Diamond. The Arena leaderboard is human-preference rated, not a static eval, which makes those numbers harder to dismiss.
The license deserves more attention than it's getting. Apache 2.0. Most releases that call themselves "open" have a usage clause buried in the terms that quietly rules out commercial deployment above a certain scale or in certain sectors. This one doesn't. The Gemma family has 400 million downloads across versions, and a lot of those teams were watching carefully for exactly this.
The Part the Benchmark Sheet Won't Tell You
Real concern: MoE routing overhead doesn't show up cleanly in benchmark numbers. Token throughput on a single long prompt is not the same as latency on a production pipeline handling bursty, short-context requests. If you're deploying the 26B MoE behind an API with variable traffic patterns, you need to profile it against your actual workload before you trust the headline figure. The benchmark setup and your inference stack are almost certainly not the same thing.
The 31B is a fine model. But without an A100 or equivalent sitting around, fine-tuning it isn't happening at full precision. The E2B and E4B are where the practically interesting story is for most teams: not because the numbers are more impressive, but because the hardware requirements are actually reachable.
Where to Actually Start
Start with Google AI Studio if you want a fast read on the 31B — no setup, runs now. Hugging Face and Ollama have weights for local deployment. vLLM and llama.cpp if you're integrating into a pipeline.
The 26B MoE is the one worth your time. The 31B is a benchmark story. The edge models are a product story. Know which problem you actually have.
Frequently Asked Questions
Q: Can I use Gemma 4 commercially without restrictions?
Yes. Gemma 4 is released under the Apache 2.0 license, which is fully commercially permissive. There are no usage clauses, revenue caps, or sector restrictions buried in the terms. You can build and ship products with it today without legal ambiguity.
Q: Which Gemma 4 model should I start with if I'm new to the family?
For most developers, the 26B MoE is the most practical starting point for server-side work: strong benchmark performance, low active parameter count at inference (3.8B), and 256K context. If you're on mobile or edge hardware, the E4B is the one to reach for. And if you just want to test capabilities quickly without any setup, the 31B is live right now in Google AI Studio.
Working on an AI integration and not sure where to start? The team at atxsoft.com helps engineering teams evaluate, integrate, and deploy open models like Gemma 4 into real products. Get in touch if you want a second opinion on your stack.
References
- Google DeepMind. Gemma 4: Our most capable open models to date. Google Blog, April 2, 2026. https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/
- Hugging Face. Welcome Gemma 4: Frontier multimodal intelligence on device. Hugging Face Blog, April 2, 2026. https://huggingface.co/blog/gemma4