Skip to content

We're upgrading our operations to serve you better. Orders ship as usual from Laval, QC. Questions? Contact us

Bitcoin accepted at checkout  |  Ships from Laval, QC, Canada  |  Expert support since 2016

Current

Llama 3.2

Meta · Llama family · Released September 2024

Meta's September 2024 Llama release added edge sizes (1B/3B) and the first open-weight Llama vision models (11B/90B).

Model card

DeveloperMeta
FamilyLlama
LicenseLlama 3.2 Community
Modalitytext+vision
Parameters (B)1,3,11,90
Context window128000
Release dateSeptember 2024
Primary languagesen,fr,de,es,it,pt,hi,th
Hugging Facemeta-llama/Llama-3.2-3B-Instruct
Ollamaollama pull llama3.2

Llama 3.2 ships today: vision lands in Llama, and 1B/3B edge models go local

Meta just announced Llama 3.2 at Meta Connect 2024, and it’s two releases packed into one. First, Llama gets vision: two multimodal models (11B and 90B) that accept text and image inputs, built on top of the existing 3.1 text backbones with adapter weights. Second, and maybe more important for the pleb rig, two new tiny text-only models (1B and 3B) designed to run on phones, Raspberry Pis, and anything else with a handful of gigabytes of RAM. Weights are out today on Meta’s release blog and Hugging Face.

This is a smaller release than Llama 3.1 back in July — no new flagship, no 405B follow-up — but it’s a strategically important one. Meta is saying, with weights in hand today: vision is table-stakes for open models, and sub-3B models aren’t toys anymore. Both claims deserve to be tested on a home rig today. Below: what’s in the release, what the benchmarks look like at launch, and what Llama 3.2 changes in a sovereign pleb’s daily AI stack.

What’s in the weights

Llama 3.2 is built on the same transformer lineage as its predecessors — Transformer (2017) → LLaMA 1 (2023) → Llama 2 → Llama 3 → Llama 3.1 → Llama 3.2 today. The vision models inherit text weights from Llama 3.1 8B and 70B respectively, with a separately trained vision adapter layer that’s been RLHF-tuned for multimodal instruction-following. Meta’s calling this approach “adapter-based” vision integration — they kept the text weights frozen during vision training, so Llama 3.2 11B-Vision is functionally identical to 3.1 8B on pure text tasks.

The vision models (11B and 90B)

  • 11B-Vision: built on Llama 3.1 8B + vision adapter (~3B adapter params)
  • 90B-Vision: built on Llama 3.1 70B + vision adapter
  • Image encoder is a CLIP-style ViT trained on image-text pairs
  • 128K text context, single-image input at inference
  • Not available in the EU at launch, per Meta’s regulatory note

The edge models (1B and 3B)

  • 1B: pruned and distilled from Llama 3.1 8B, designed for on-device inference
  • 3B: same recipe, larger target — fits comfortably in 2–3GB of RAM at Q4
  • Both support 128K context (same as the big models)
  • Trained with knowledge distillation from 3.1 8B and 70B as teachers
  • Targeted at phones, laptops, embedded devices, edge inference

The 1B and 3B models are where Meta’s doing the most interesting work. They used structured pruning on the 8B backbone and then distilled the output logits from Llama 3.1 70B and 405B — so the small models are getting training signal from models 100x their size. This is the same “distill a giant into a tiny” trick that made Gemma 2 2B punch above its weight; now Meta is doing it in the Llama family.

Benchmarks at release

Numbers from Meta’s release blog:

  • Llama 3.2 3B vs Gemma 2 2B vs Phi 3.5-mini: Meta claims 3B beats both on MMLU, ARC, GSM8K, and instruction-following — a clear lead in the sub-5B open class at release.
  • Llama 3.2 1B: positioned as “on-device flagship” — Meta’s numbers put it competitive with much larger models on tool use and summarization, which are the workloads that actually matter on a phone.
  • Vision 11B: Meta benchmarks against Claude 3 Haiku and GPT-4o-mini on image understanding tasks (MMMU, ChartQA, DocVQA). 11B is competitive with Haiku, slightly behind GPT-4o-mini at release.
  • Vision 90B: goes head-to-head with GPT-4o-mini and Claude 3 Haiku on multimodal reasoning, with Meta claiming a lead on chart and diagram understanding.

As always at release, these are the creator’s chosen benchmarks on the creator’s chosen suites. The Open LLM Leaderboard will rank 3B against Gemma and Qwen in the next few days, and lmsys will sort out the vision models’ real preference rankings against GPT-4o-mini over the next month.

Sovereign pleb implications

This release is a gift to two very different pleb rigs.

The edge tier (RPi / laptop / phone). Llama 3.2 1B in Q4 is about 800MB on disk. Q8 is ~1.3GB. Either runs comfortably on a Raspberry Pi 5 with 8GB RAM, a modest laptop CPU, or a mid-range phone. For plebs building sovereign always-on assistants — the kind you’d wire into Home Assistant or Obsidian for local note-taking and home automation — 1B and 3B are finally in the “good enough to be useful” tier. That’s a genuine change from six months ago, when the sub-3B open class was mostly a toy.

The GPU tier (3090 / 4090 / dual card). Llama 3.2 11B-Vision at Q4 is about 7GB — fits on a single 3090 with tons of headroom for context. This is the first Llama model where a pleb on a used 3090 rig can do full multimodal chat locally with competitive quality. The 90B-Vision model needs dual 3090s (48GB VRAM) at Q4, same as Llama 3.1 70B — so if you were already running 3.1 70B, you get vision at the same hardware cost. Check our GGUF quantization guide for the tradeoffs between Q4_K_M and Q5_K_M at each model size.

What this replaces: for plebs who had been using LLaVA or CogVLM for local vision, Llama 3.2 11B-Vision is a straightforward upgrade — same rough VRAM footprint, better benchmark quality, same license family. For plebs running Llama 3.1 8B as their fast daily driver, stay put: 3.2 didn’t replace the 8B text-only flagship. For plebs who wanted local vision but couldn’t spare the VRAM, 11B-Vision at Q4 is the new entry ticket.

For the Hashcenter crowd, the 1B/3B models are interesting from an inference-density angle: a single 24GB GPU can serve dozens of concurrent 1B sessions, which makes edge-of-network deployment at scale much more viable than it was yesterday.

How to run it today

All four Llama 3.2 models are live on the Ollama registry at release:

ollama pull llama3.2:1b
ollama pull llama3.2:3b
ollama pull llama3.2-vision:11b
ollama pull llama3.2-vision:90b

New to Ollama? Our 10-minute install guide walks through setup end-to-end. For the vision models, pair with Open WebUI to get a clean image-upload interface — Open WebUI added Llama 3.2 vision support in its latest release.

For edge deployment (RPi, laptops), llama.cpp is the cleanest path — GGUF quants of the 1B and 3B models are already showing up on Hugging Face from the community. The official fp16 weights are on the Meta Llama org if you want to build your own quants.

What comes next

Llama 3.2 landed without a new flagship text-only model — no 3.2 70B, no 3.2 405B. That suggests Meta is holding those tiers for a bigger release later. The next thing to watch: how the community re-quantizes and fine-tunes the 1B and 3B, and whether the vision models see instruction-tuned variants that close the gap on GPT-4o-mini. Vision adapter weights are small enough that creative fine-tunes (medical imaging, document-specific, niche domains) should appear on Hugging Face within days.

For sovereign plebs, the headline is simple: the “you need a Hashcenter for multimodal AI” argument just got weaker. An 11B vision model runs on a single used 3090 today. A capable 3B chat model runs on a Raspberry Pi today. Own the hardware, pull the weights, run the stack. See the Sovereign AI Manifesto for the full case, and the pleb’s guide to self-hosted AI for the next step.

Further reading: The same pleb-grade infrastructure that runs local inference also runs a Bitcoin space heater. Many readers arrive from the mining side — see From S19 to Your First AI Hashcenter for the bridge.

Recommended hardware

Multi-GPU rig or cloud territory. For most plebs, the 70B distillation is plenty.

Buying guide: used RTX 3090 for LLMs (2026) →

Get it running

  1. 01 Install Ollama →

    Ten-minute local LLM runtime. One binary, zero cloud.

  2. 02 Give it a web UI →

    Open-WebUI turns Ollama into a self-hosted ChatGPT.

  3. 03 Understand quantization →

    GGUF Q4/Q8/FP16 — which weights fit your GPU, explained.

Further reading: the Sovereign AI for Bitcoiners Manifesto for why sovereign inference matters, and From S19 to Your First AI Hashcenter for repurposing your mining rack into a Hashcenter that runs models like this one.