Gemma 3
Google · Gemma family · Released March 2025
Google DeepMind's March 2025 Gemma family — vision-capable (4B+), 128K context, with official quantization-aware 4-bit variants.
Model card
| Developer | |
|---|---|
| Family | Gemma |
| License | Gemma Terms |
| Modality | text+vision |
| Parameters (B) | 1,4,12,27 |
| Context window | 128000 |
| Release date | March 2025 |
| Primary languages | en,fr,de,es,it,pt,ja,ko,zh,ar,hi |
| Hugging Face | google/gemma-3-12b-it |
| Ollama | ollama pull gemma3 |
Google just released Gemma 3, and the third generation of the open-weight Gemma family finally feels like a serious pleb tool rather than a research curio. The headline numbers: four sizes (1B, 4B, 12B, 27B), a 128K context window on the larger variants, native multilingual support for 140+ languages, and—this is the one that matters—vision capability built in. Gemma 3 is the first Google open-weight model that can see images.
Released under the Gemma license (permissive for most commercial use), these weights are up on Hugging Face, in Google’s announcement, and landing in Ollama today. For plebs who want a genuinely capable vision-language model running locally—describing screenshots, reading charts, captioning photos, answering questions about diagrams—Gemma 3 is the most accessible option that’s ever existed.
What’s in the weights
Gemma 3 descends from a research lineage that’s been public but undersold. Google’s Gemma 1 launched in February 2024 as a sibling to Gemini, distilling research from the closed-weight flagship into an open 2B/7B pair. Gemma 2 in June 2024 upgraded to 2B/9B/27B and introduced the alternating local/global attention pattern that’s become a Gemma signature. Gemma 3 keeps that attention scheme, adds vision, and pushes context to 128K.
Four sizes today:
- Gemma 3 1B: Text-only, 32K context. The "runs on your phone" tier.
- Gemma 3 4B: Multimodal (text + vision), 128K context. Single-GPU daily driver or Mac laptop model.
- Gemma 3 12B: Multimodal, 128K context. Sweet spot for 24GB consumer cards.
- Gemma 3 27B: Multimodal, 128K context. The pleb flagship—fits at Q4 on a single 3090.
Architecturally, Gemma 3 is a decoder-only Transformer with Grouped Query Attention, RoPE positional embeddings, and the hallmark Gemma 2 pattern of interleaved local and global attention layers—five local-attention layers with a 4K window, then one global-attention layer with the full 128K window. This is Google’s way of keeping KV cache memory sensible at long context: most of the attention is cheap and local, and the few global layers do the long-range work. The technical report describes this as the main reason Gemma 3 can actually use 128K context on consumer hardware without OOMing.
Vision is implemented via a SigLIP image encoder (a Google research output) feeding image tokens into the shared Transformer. The 4B, 12B, and 27B variants accept interleaved text and images in a single prompt. The image encoder is frozen during LLM training, which keeps the vision-language integration light and reproducible.
Training data: Google reports 14 trillion tokens for the 27B, 12 trillion for the 12B, 4 trillion for the 4B, and 2 trillion for the 1B. The data mix includes a substantial code corpus, math-heavy synthetic data, and explicitly expanded multilingual coverage—Google claims 140+ languages with meaningful capability, up sharply from Gemma 2’s effectively-English-centric focus. The Gemma 3 technical report has the full details, including data-contamination audits and the usual disclosures about filtering.
Benchmarks at release
Per Google’s release blog and the Gemma 3 technical report, scores on public benchmarks at release:
- MMLU-Pro: 27B scores 67.5, 12B at 60.6, 4B at 43.6—competitive with much larger models
- LiveCodeBench: 27B at 29.7, showing solid but not frontier-tier code capability
- Global MMLU (multilingual): 27B at 75.1, reflecting the expanded language training
- MMMU (multimodal reasoning): 27B at 64.9, strong vision-language performance
- DocVQA: 27B at 85.6, making it usable for document understanding tasks out of the box
Google places Gemma 3 27B as competitive with models 2–3× its parameter count on the LMSys Chatbot Arena ranking, claiming an Elo around 1338 at release—above Llama 3 70B’s position and approaching GPT-4o territory for general chat. Arena rankings are noisy and shift as more votes come in, so treat the number as a rough signal rather than gospel. Community reproductions on the Open LLM Leaderboard will tell the real story over the next few weeks.
The most interesting benchmark claim, if it holds up: Gemma 3 4B multimodal is competitive with Gemma 2 27B on text tasks. A 7× parameter reduction at similar quality would be a notable efficiency win, and it’s the kind of claim that independent evaluators will scrutinize carefully.
What it means for the sovereign pleb
For the sovereign AI thesis, Gemma 3 is the vision-language piece that’s been missing. Until today, plebs who wanted local multimodal had to choose between Moondream (small and fast but limited), LLaVA (dated), InternVL (capable but awkward), or the various Qwen-VL releases (great but rarely the default). Gemma 3 27B is the first "it just works" open-weight vision-language model at the quality tier plebs actually want.
VRAM requirements at Q4_K_M:
- Gemma 3 1B: ~700MB — phone-class, runs anywhere
- Gemma 3 4B: ~3GB — RTX 3050 8GB, any M-series Mac, low-end gaming laptop
- Gemma 3 12B: ~8GB — RTX 3060 12GB sweet spot, Mac with 16GB+ unified memory
- Gemma 3 27B: ~17GB — single RTX 3090/4090, A5000, or Mac with 32GB+. Pleb flagship.
Vision inference adds a modest VRAM overhead for the SigLIP encoder (~800MB extra) and per-image token budget. A 1024×1024 image consumes roughly 256 image tokens from Gemma 3’s perspective, so a 5-image prompt eats ~1.3K of your context window for the images alone. Plan accordingly when feeding long documents.
For the used RTX 3090 pleb rig, Gemma 3 27B Q4 is the new vision-language default. It replaces LLaVA-based stacks, it makes vision workflows in Open WebUI tenable without a dedicated second model, and it gives you a single 128K-context model that can handle both text and images in one context. For quant selection advice, our GGUF explainer covers the tradeoffs—Q4_K_M remains the pleb default for the 27B on a single 24GB card, and Q8 on dual-GPU setups is overkill for most workloads but excellent for document-heavy tasks where every bit of multimodal fidelity matters.
A Hashcenter workflow that makes sense today: Gemma 3 27B for multimodal general work (describing screenshots, reading diagrams, captioning product photos for an e-commerce pleb workflow), paired with a text-focused 70B like Llama 3.1 for deep reasoning tasks. The 27B leaves generous VRAM headroom on a 24GB card for large images or long context; the 70B on dual 3090s handles the heavy lifting. Route queries via a lightweight orchestrator or Open WebUI’s model-picker.
If you’re integrating with a home automation stack, our Home Assistant integration guide shows how to pipe local models into your smart home—Gemma 3 4B is a strong candidate for the always-on low-latency classifier role, while the 27B handles vision-heavy occasional queries like "what’s this package on the porch?"
How to run it today
Quickstart via Ollama:
ollama pull gemma3:4b
ollama pull gemma3:12b
ollama pull gemma3:27b
ollama run gemma3:27b
For vision, pass an image via the Ollama API or drop it into Open WebUI‘s attachment UI—the model automatically detects multimodal input.
Hugging Face source: google/gemma-3-27b-it for the instruction-tuned 27B. You’ll need to accept the Gemma license on HF to pull directly. GGUF quants from community maintainers (bartowski, unsloth) tend to appear within a day of release. LM Studio should have the model indexed today; see our LM Studio vs Ollama vs llama.cpp comparison for runner selection. For ComfyUI users who want vision-to-prompt workflows, our ComfyUI pleb primer will see a Gemma 3 workflow update shortly.
What comes next
Google’s cadence on Gemma has settled into roughly one major generation per year, with minor point releases in between. Gemma 2’s 2→9→27B tier jump was significant; Gemma 3’s addition of vision is a capability-class change. The obvious next frontier for the Gemma family is audio—Gemini supports it, and it’s reasonable to expect a future Gemma generation will too. Google’s open-source positioning has been inconsistent historically, but the Gemma 3 release is a clear signal that they’re treating open weights as a strategic product rather than a concession.
For plebs, the message is simple. Gemma 3 27B is now the default vision-language model on a single RTX 3090. Pull the weights, spin up Ollama, and build your own multimodal workflows. Sovereignty includes your eyes now, not just your words. Run your own inference on your own hardware in your own Hashcenter—and if you’re converting retired ASIC hardware to AI work, Gemma 3 27B is a strong workload candidate for the vision portion of a mixed stack. Nobody else needs to see your screenshots.
Benchmark history
Last benchmarked: March 12, 2025 Needs refresh
| Benchmark | Score | Source | Measured |
|---|---|---|---|
| MATH | 50 | vendor_blog ✓ | March 12, 2025 |
| GPQA | 24.3 | vendor_blog ✓ | March 12, 2025 |
| HumanEval | 48.8 | vendor_blog ✓ | March 12, 2025 |
| MMLU | 78.6 | vendor_blog ✓ | March 12, 2025 |
Recommended hardware
Runs well on 24 GB VRAM (3090 / 4090) at Q4–Q5. A used 3090 is the pleb pick.
Get it running
-
01
Install Ollama →
Ten-minute local LLM runtime. One binary, zero cloud.
-
02
Give it a web UI →
Open-WebUI turns Ollama into a self-hosted ChatGPT.
-
03
Understand quantization →
GGUF Q4/Q8/FP16 — which weights fit your GPU, explained.
Further reading: the Sovereign AI for Bitcoiners Manifesto for why sovereign inference matters, and From S19 to Your First AI Hashcenter for repurposing your mining rack into a Hashcenter that runs models like this one.
