You already know how to run a model. The hard part in 2026 is choosing which one. The open-weight scene moved from a curiosity to a genuine arsenal: DeepSeek, Qwen, Mistral, Gemma and Llama all ship weights you can download, inspect, and run on your own silicon with no account, no telemetry, and no kill switch. For a Bitcoiner, that is the whole point. You don’t custody your own keys and then rent your reasoning from a corporation that logs every prompt. Owning your weights is one more layer decentralized — your data, your compute, your algorithms, end to end.
This is the decision layer above the runner. If you haven’t picked your tooling yet, start with install Ollama and run your first local LLM in 10 minutes or compare engines in LM Studio vs Ollama vs llama.cpp. This guide answers the next question: given your VRAM and what you actually want to do, which family of open weights deserves the download? We’ll sort by openness, size class, and task fit — not by hype-cycle benchmark numbers, which age badly and rarely survive contact with your own hardware.
What “open weight” actually means (and why it matters to a pleb)
“Open weight” is not the same as “open source.” Most of these models publish the trained parameters — the weights — under a license that lets you download and run them locally. Far fewer publish the full training data, training code, and pipeline that would let you reproduce them from scratch. For self-hosting, weights are what you need: once they’re on your disk, the model runs offline, forever, regardless of what the vendor does next.
But the license still matters. Some weights are released under standard permissive licenses (Apache 2.0, MIT) that impose almost no conditions. Others ship under custom community licenses with use restrictions, acceptable-use policies, or clauses that kick in above a certain user count. None of that affects a pleb running a model on a home box, but it shapes how “yours” the model truly is. If sovereignty is the goal, a model you can run, fork the surrounding tooling for, and never phone home with is the floor — and all five families below clear it.
One honest caveat up front: we are deliberately not quoting leaderboard scores. Benchmark rankings shift monthly, vendors optimize for the tests, and the only benchmark that matters is whether the model is good at your task on your machine. Sort by license, size class, and task fit first. Then download two finalists and try them on your real prompts.
The five families, in plain terms
Each of these is a family, not a single model — they ship in multiple parameter sizes so you can match the model to your hardware. Here’s how to think about each one.
- Llama (Meta) — The default starting point for most plebs. The most mature ecosystem: nearly every runner, fine-tune, and tutorial assumes Llama-family weights exist. Released under a community license rather than a fully permissive one, but trivially runnable at home. If you want the path of least resistance and the largest pile of community fine-tunes, start here.
- Qwen (Alibaba) — Strong all-rounder with an unusually wide range of sizes, from tiny models that fit on a phone to large ones that need serious VRAM. Notably good multilingual coverage and solid coding variants. Several sizes ship under Apache 2.0, which is about as permissive as it gets. A frequent favorite when people want one family that scales across all their devices.
- DeepSeek — Made its name with reasoning-focused and coding-focused releases. The largest DeepSeek models are far too big for a home GPU, but the family also ships smaller distilled variants that bring some of that reasoning flavor down to consumer hardware. Reach for DeepSeek when you specifically want step-by-step reasoning or code, and pick a size that actually fits your box.
- Mistral — The lean European option. Mistral built its reputation on small, fast, capable models that punch above their parameter count — including releases under Apache 2.0. Excellent when VRAM is tight and you want fast tokens without feeling like you crippled the model. A natural pick for an 8GB card or a node that does other jobs too.
- Gemma (Google) — Google’s open-weight line, designed to run well on modest hardware and offered in genuinely small sizes alongside larger ones. Good general-purpose quality with a focus on safety and efficiency. A sensible default for a low-power always-on box that needs to be polite and dependable rather than maximally capable.
Match the model to your VRAM, not your ego
The single biggest factor in what runs well is how much the model weighs once loaded — and that depends on its parameter count and its quantization. Quantization shrinks the weights so they fit in less memory; the trade is a small quality hit. If those terms are new, read our pleb’s guide to GGUF, Q4, Q8 and fp16 — it’s the companion to this post and explains exactly how a 4-bit quant lets a bigger model squeeze onto a smaller card.
The rough mental model: a quantized model needs roughly its parameter-count-in-billions worth of gigabytes of VRAM at 4-bit, plus headroom for context. So a 7-8B model at Q4 lives comfortably on 8GB; a 13-14B wants more breathing room; and 30B-plus is workstation or multi-GPU territory. Treat the table below as a starting map, then confirm on your own hardware — system RAM offload, context length, and your runner all move the goalposts.
| Your hardware | Practical size class | Sensible families to try first | Best for |
|---|---|---|---|
| 8GB VRAM (e.g. older mid-range card) | ~3B-8B at Q4 | Mistral, Gemma (small), Qwen (small), Llama (small) | Chat, summarizing, drafting, light coding |
| 12-16GB VRAM | ~8B-14B at Q4-Q5 | Qwen, Llama, DeepSeek distill, Mistral | General assistant, decent coding, reasoning |
| 24GB VRAM (e.g. used RTX 3090) | ~14B-32B at Q4 | Qwen, DeepSeek distill, Llama, Mistral | Stronger reasoning/code, long context, agents |
| Multi-GPU / workstation | 32B-70B+ | Llama (large), Qwen (large), DeepSeek (large) | Heavy reasoning, near-frontier local quality |
| CPU-only / SBC | ~1B-3B at Q4 | Gemma (tiny), Qwen (tiny), Llama (tiny) | Always-on helpers, log-reading, light tasks |
The used 24GB card remains the sweet spot for serious home inference — see why in the used RTX 3090 for LLMs in 2026. It opens up the 14B-32B band where local models start to feel genuinely useful rather than merely impressive.
Pick by what you actually do
Hardware sets the ceiling; your task picks the family within it. A few honest rules of thumb:
- You want a general daily assistant. Start with Llama or Qwen at the largest size your card handles. The mature ecosystem means more fine-tunes and fewer rough edges.
- You’re coding or want step-by-step reasoning. Try a DeepSeek reasoning/coding variant or a Qwen coder at your size class. Reasoning-tuned models think out loud, which costs tokens but pays off on logic-heavy work.
- VRAM is tight and tokens-per-second matters. Mistral and small Gemma are the lean champions — fast, capable, and they leave headroom for the model to share a box that’s also running your node or babysitting your rigs.
- You need an always-on, low-power helper. A tiny Gemma or Qwen on a small box can read miner logs, summarize alerts, or run a home-automation hook without a power-hungry GPU spinning all day.
- You work across languages. Qwen’s multilingual coverage is a frequent reason plebs outside the anglosphere reach for it first.
Whatever you pick, the running-it-locally part is its own discipline. Ollama is the easy part — owning the power and heat underneath is the sovereign part, and it’s where the Bitcoin-mining mindset transfers directly. You already know how to think about watts, cooling, and uptime.
Why this is a Bitcoin-mining company’s fight too
It’s a fair question why a Laval ASIC shop has opinions on language models. The answer is that the two worlds run on the same principle: don’t trust a custodian to do for you what you can do for yourself. We build firmware — DCENT_OS, an open-source firmware aimed at industrial Antminer hardware, built in Rust and standing on the shoulders of Braiins OS+, VNish and LuxOS — precisely so miners can own the code that runs their machines instead of renting it back with a mandatory dev fee. Self-hosting an open-weight model is the same move in a different domain: own the weights, own the inference, own the data trail. It’s one more layer decentralized.
If you’re building the full picture — sats, hashpower, and now compute under one roof — our sovereign AI hub and the broader sovereignty section map the rest of the stack. And if you want a concrete first project, wiring a local model into your node is a great one: run an agent on your Bitcoin node instead of renting a VPS.
A simple selection workflow
Don’t overthink it. Here’s the whole process:
- Know your VRAM. That sets your size class from the table above.
- Name your main task. Daily chat, coding, reasoning, multilingual, or always-on helper.
- Shortlist two families. One safe default (Llama or Qwen) and one specialist (DeepSeek for reasoning/code, Mistral for lean speed, Gemma for low power).
- Download both at the right quant. Q4 is the usual home default; go higher if you have headroom, lower only if you must.
- Test on your real prompts. Not benchmarks — your actual work. Keep the one that’s better for you and delete the other.
That’s it. The beauty of open weights is that switching costs nothing but disk space and a download. You’re never locked in.
Frequently asked questions
What’s the best local LLM for 8GB of VRAM?
Stay in the ~3B-8B class at 4-bit quantization. A small Mistral, a small Gemma, or a small Qwen or Llama all run comfortably and handle chat, summarizing, and light coding well. Download two and keep whichever is better at your actual tasks — none of them will fight your card.
Is open weight the same as open source?
No. Open weight means the trained parameters are published so you can download and run the model locally. Open source would additionally mean the training data, code, and pipeline are public enough to reproduce it from scratch — which is rarer. For self-hosting you mainly care about the weights and the license terms; all five families here are runnable at home offline.
DeepSeek vs Qwen vs Llama — which should I pick?
For a general daily assistant, Llama or Qwen are the safe defaults thanks to mature ecosystems. For reasoning or coding, a DeepSeek variant (in a size that fits your hardware) is worth trying. Qwen also stands out for multilingual work and its wide range of sizes. Pick based on task and VRAM, then test both on your own prompts.
Do I need a GPU at all?
Not necessarily. Tiny 1B-3B models in the Gemma, Qwen, or Llama families run on CPU or a small board well enough for always-on helpers — reading logs, summarizing, simple automation. You trade speed and capability for low power and no GPU, which is often exactly the right trade for a background assistant.
The bottom line
There is no single “best local LLM” in 2026 — there’s the best one for your VRAM and your task, and it changes as new weights drop. The durable skill is the selection process: match size to hardware, family to task, download two finalists, and judge them on your own work instead of someone’s leaderboard. Every model here is yours to keep the moment it’s on your disk.
That ethos — own the thing that runs your stack — is exactly what we’re building in firmware. If you run Antminer hardware and want open-source firmware with a 0% mandatory-dev-fee target, join the DCENT_OS closed beta waitlist. Same fight, different layer: your keys, your weights, your code.



