
Google releases Gemma 4 QAT models for on‑device AI
Google unveiled Gemma 4 quantization‑aware training models that shrink size by up to 4× and keep accuracy within 1‑2 % of the full‑precision baseline, targeting smartphones and laptops.
Google announced Gemma 4 quantization‑aware training (QAT) models, 2 B‑parameter and 5 B‑parameter transformers that are 2‑4× smaller than full‑precision versions and deliver 30‑45 % lower latency on ARM smartphones and Intel i7 laptops [Google Blog].
What shipped
The release includes two checkpoint families: Gemma‑4‑2B‑QAT (2 B parameters, 1.2 GB storage) and Gemma‑4‑5B‑QAT (5 B parameters, 2.8 GB storage). Both use per‑channel 8‑bit weight quantization and 8‑bit activation scaling, preserving within‑1 % top‑1 accuracy on the MMLU benchmark compared to the original Gemma 4 FP16 models [Google AI Blog]. Google provides a TensorFlow Lite converter, a PyTorch‑compatible loader, and a reference inference script that reaches 3.6 GFLOPs / s on a Snapdragon 8 Gen 2 chipset. The models are Apache 2.0‑licensed and downloadable from the Google Cloud Storage bucket linked in the blog post.
Why it matters
- On‑device compute becomes viable for larger models. A 5 B‑parameter model under 3 GB fits on a laptop with 16 GB RAM, eliminating the need for cloud inference in many enterprise chat‑bot workloads.
- Battery life improves dramatically. The QAT models consume roughly 40 % less power per inference on a Pixel 8 Pro, extending continuous on‑device AI usage from 2 hours to over 3 hours [Google AI Blog].
- Developers get a ready‑made compression pipeline. By releasing the QAT checkpoints and conversion tools, Google removes the need for custom quantization code, speeding time‑to‑market for on‑device AI features.
Editor’s take
The modest 1‑2 % accuracy dip is outweighed by the shift in edge‑AI economics: developers can ship models that were previously cloud‑only without sacrificing user experience, forcing cloud providers to rethink inference pricing and giving hardware vendors a concrete benchmark for next‑gen AI accelerators.
Poll
Which on‑device AI strategy will you double‑down on?
- Quantization‑aware models (Gemma 4)
- Full‑precision models with cloud fallback
- Pruning + distillation pipelines
- Custom hardware‑accelerated inference
Subscribe to the broadcast.
Daily digest of the day's most important tech news. No fluff. Engineering signal only.
// delivered via substack · double-opt-in confirmation


