Skip to content

We're upgrading our operations to serve you better. Orders ship as usual from Laval, QC. Questions? Contact us

Bitcoin accepted at checkout  |  Ships from Laval, QC, Canada  |  Expert support since 2016

Superseded

Gemma 2

Google · Gemma family · Released June 2024

Google DeepMind's June 2024 lightweight open model family — 2B, 9B, and 27B with interleaved local/global attention.

Model card

DeveloperGoogle
FamilyGemma
LicenseGemma Terms
Modalitytext
Parameters (B)2,9,27
Context window8192
Release dateJune 2024
Primary languagesen
Hugging Facegoogle/gemma-2-9b-it
Ollamaollama pull gemma2

Gemma 2 lands: Google’s open model grows up, with a 27B flagship that actually competes

Google just released Gemma 2 — the next generation of its open-weights model line — with both 9B and 27B parameter variants available today on Hugging Face, Kaggle, and Vertex AI. The first Gemma release in February 2024 was a respectable 2B/7B debut, but it never broke through the Mistral 7B or Llama 2 stranglehold on the small-open class. Gemma 2 is Google’s serious second attempt — and at 27B, it’s the largest open model Google has ever shipped.

Announced at Google I/O in May and dropped today, Gemma 2 is Google’s pitch that the Gemini research line can ship real open weights, not just gate-kept API models. The 27B sits in a sparsely populated tier: bigger than Llama 3 8B, smaller than Llama 3 70B, and it’s pitched directly at local-inference plebs who want more than 8B can deliver but don’t have dual-3090 VRAM to spare. Below: what’s under the hood, what the benchmarks look like at launch, and where Gemma 2 fits in a sovereign stack.

What’s in the weights

Gemma 2 shares its architecture DNA with Google’s closed Gemini line — the release blog is explicit that it “uses the same research and technology as Gemini.” What Google can say publicly, and did today: decoder-only transformer, interleaved local/global attention (local 4K windows alternating with global 8K), logit soft-capping for training stability, grouped-query attention, and RoPE positional encoding. The lineage: Transformer (2017) → the broader Gemini research line at DeepMind → Gemma 1 (February 2024) → Gemma 2 today.

Key specs:

  • Two sizes: 9B and 27B parameters (a 2B variant is promised for a later drop)
  • Context window: 8,192 tokens — modest by 2024 standards, notably smaller than Llama 3’s 8K-extending-to-128K
  • Training data: 13T tokens for the 27B, 8T tokens for the 9B — heavy on English web, code, and math
  • Tokenizer: 256K vocabulary (large, optimized for multilingual though English is primary)
  • Distillation: the 9B was trained with knowledge distillation from a larger (unreleased) teacher model — same trick used elsewhere in the Gemini line
  • License: Gemma Terms of Use — permissive for commercial use, with standard safety and attribution clauses

The interleaved local/global attention is the architectural signature here. It’s a practical compromise: global 8K attention on every layer is expensive, but pure-local attention loses long-range context. Alternating the two keeps long-range mixing alive while halving the attention compute budget for a given context length. It’s a trick Google has refined in their internal research and is now landing in open weights.

Benchmarks at release

From Google’s release blog and HF model card, published today:

  • Gemma 2 27B on the lmsys Chatbot Arena leaderboard (preliminary): matching or edging out Llama 3 70B in human preference voting — a notable result, since the 27B is less than half the size.
  • MMLU (5-shot): 27B at 75.2, 9B at 71.3 — 27B lands between Llama 3 8B (68.4) and Llama 3 70B (82.0).
  • MATH: 27B at 42.3, 9B at 36.6 — competitive with Llama 3 70B-class.
  • HumanEval (code): 27B at 51.8, 9B at 40.2 — solid but not class-leading; Llama 3 70B still ahead on code.
  • GSM8K (grade school math): 27B at 74.0, 9B at 68.6 — strong relative to size.
  • BBH (Big-Bench Hard): 27B at 74.9, 9B at 68.2.

The headline number for plebs is the Chatbot Arena result. If a 27B open model really does prefer-vote competitively against Llama 3 70B, that’s a capability-per-parameter win that matters at home-rig scale. Expect the Open LLM Leaderboard to confirm or correct in the next few days.

Sovereign pleb implications

Gemma 2 lands in a VRAM sweet spot that’s been underserved. Here’s the practical math:

  • Gemma 2 9B at Q4_K_M: about 5.5GB. Runs comfortably on any GPU with 8GB VRAM — a 3060 Ti, a 4060, even a used 2070. This is the fast daily-driver tier.
  • Gemma 2 27B at Q4_K_M: about 17GB. Fits on a single used RTX 3090 (24GB) or 4090 with room to spare for 8K context. This is the interesting tier.
  • Gemma 2 27B at Q5_K_M: about 19GB. Still single-3090 territory, noticeably sharper quality.
  • Gemma 2 27B at Q8: about 29GB. Needs offload on a 24GB card, or runs clean on a 48GB dual-card setup.

See the GGUF quantization guide for the quality/size tradeoffs. On 27B-class models, Q5_K_M is usually the right pick if VRAM allows — the quality jump from Q4 to Q5 is noticeable on reasoning tasks, and the size delta is modest.

What this replaces in the daily stack: for plebs on a single-GPU rig who were running Llama 3 8B for speed and feeling the ceiling, Gemma 2 27B is the natural step up — it fits on the same card at Q4, and it’s meaningfully smarter. For plebs running dual 3090s specifically to host Llama 3 70B, Gemma 2 27B on a single card frees up the second card for embeddings, image generation, or a second model in parallel (Open WebUI supports this cleanly).

The Hashcenter-pivot crowd will care about inference density: at 27B, you can pack more concurrent sessions onto a single A100 80GB than you can with 70B-class models, which changes the economics of serving open models to a small user base. For inference heaters, a 27B under sustained load on a single 3090 gets you a 350W heat source that’s doing useful work — a reasonable office-heating profile.

The 8K context window is the catch. Most plebs won’t notice it for chat or coding. For long-document RAG workflows, Llama 3’s 128K option stays ahead.

How to run it today

Gemma 2 is on the Ollama registry at release:

ollama pull gemma2:9b
ollama pull gemma2:27b

New to Ollama? Our 10-minute install guide covers setup. Pair with Open WebUI for a clean local chat interface.

LM Studio has Gemma 2 GGUFs available through its built-in Hugging Face browser — look for the official Google quants or the lmstudio-community rebuild. The fp16 weights are on Google’s HF org for anyone building custom quantizations or fine-tunes. Hit issues? The self-hosted AI troubleshooting guide covers the common GPU and VRAM snags.

What comes next

Google pre-announced a 2B Gemma 2 variant for a later release — that one will be worth watching as an edge / Raspberry Pi-class model. No timeline was given for instruction-tuned variants beyond what shipped today, but the community will have fine-tunes up on Hugging Face within the week: coder tunes, roleplay tunes, regional-language tunes.

Bigger picture: with Gemma 2, Google is committing to the open-weights arena more seriously than the 7B-class Gemma 1 release suggested. An Apache-adjacent license, a competitive 27B flagship, and same-day availability on Ollama is a different posture than “we released a small model for research.” That’s good for sovereign plebs — more credible frontier options at home-rig scale means more competition to push quality up and quant footprints down. Pull it, run it, own your stack. See the Sovereign AI Manifesto for the case, and the pleb’s guide to self-hosted AI for the setup.

Further reading: The same pleb-grade infrastructure that runs local inference also runs a Bitcoin space heater. Many readers arrive from the mining side — see From S19 to Your First AI Hashcenter for the bridge.

Benchmark history

Last benchmarked: June 27, 2024 Needs refresh

Benchmark Score Source Measured
MATH 42.3 vendor_blog  ✓ June 27, 2024
HumanEval 51.8 vendor_blog  ✓ June 27, 2024
MMLU 75.2 vendor_blog  ✓ June 27, 2024

Recommended hardware

Runs well on 24 GB VRAM (3090 / 4090) at Q4–Q5. A used 3090 is the pleb pick.

Buying guide: used RTX 3090 for LLMs (2026) →

Get it running

  1. 01 Install Ollama →

    Ten-minute local LLM runtime. One binary, zero cloud.

  2. 02 Give it a web UI →

    Open-WebUI turns Ollama into a self-hosted ChatGPT.

  3. 03 Understand quantization →

    GGUF Q4/Q8/FP16 — which weights fit your GPU, explained.

Further reading: the Sovereign AI for Bitcoiners Manifesto for why sovereign inference matters, and From S19 to Your First AI Hashcenter for repurposing your mining rack into a Hashcenter that runs models like this one.