TL;DR — For the pleb running LLMs under 70B parameters at Q4–Q5 quants, a used RTX 3090 at ~$600–$800 remains the best $/VRAM play on the consumer market in 2026. The 4090 wins raw tok/s by ~30–50% on smaller models; the 5090 dominates flagship workloads but costs roughly 3× a used 3090. Below Llama 3.1 70B, the 3090 is still king. Above it, you want a pair of 3090s, a 4090, or — if the budget’s there — a 5090. Credit where it’s due: NVIDIA’s Ampere architecture is holding up six years later, and the open-source inference stack (llama.cpp, Ollama, vLLM) has squeezed more out of 24 GB than anyone expected.

The question lands in our inbox every week. Pleb has a shed, 240V service he ran for an S19, a heap of Bitcoin mining instincts, and now he wants to run his own LLMs. The self-hosted AI world is drowning in “buy an H100” advice that’s useless to someone who paid $2,400 for a miner and considered that a splurge. So here’s the honest answer, built for the pleb who already knows watts, amps, and thermals — just not tokens per second.

We’ll work through it the same way you’d approach a used hashboard: specs, comparable alternatives, a buyer’s checklist, and when to walk away.

The 3090’s unfair advantage — 24 GB VRAM

If you’re new to LLM hardware, forget everything you know about “faster = better.” For inference, VRAM is king. Think of it like the RAM on your Bitcoin node: if your chainstate doesn’t fit in memory, your node crawls. If your LLM weights don’t fit in VRAM, the model either won’t load at all, or it spills into system RAM and slows to a crawl that makes you question your life choices.

Modern open-source models are measured in billions of parameters. Each parameter, unquantized (FP16), takes 2 bytes. Llama 3.1 70B at FP16 wants ~140 GB of memory. No consumer GPU has that. So we quantize — compress the weights down to 4 or 5 bits per parameter — and suddenly that 70B model fits in ~40 GB. Still doesn’t fit on one 3090. But on two? Comfortable. More on that in a minute.

The 3090 specs that still matter:

Spec	RTX 3090
VRAM	24 GB GDDR6X
Memory bandwidth	936 GB/s
Memory bus	384-bit
CUDA cores	10,496
Tensor cores	328 (3rd gen)
TDP	350W
Architecture	Ampere (GA102)
PCIe	4.0 x16
Release	September 2020

The headline is that 24 GB. The runner-up is 936 GB/s bandwidth — inference speed on large models is memory-bandwidth-bound far more than it is compute-bound, and the 3090’s bandwidth still beats every consumer card NVIDIA shipped until the 4090. The third thing to notice: Ampere has full bf16 support and native INT8 tensor cores, which means llama.cpp and vLLM can actually use the silicon. Pascal-era cards (P40, P100) can’t say that.

On quantization tradeoffs — a quick pleb primer so the rest of this article makes sense:

FP16 / BF16: full precision. 2 bytes per parameter. Best quality, biggest VRAM hit.
Q8: 8-bit quantization. ~1 byte per parameter. Nearly indistinguishable from FP16 in most tests.
Q5_K_M: 5-bit, the community sweet spot. ~0.6 bytes/param. Quality loss is small.
Q4_K_M: 4-bit. ~0.5 bytes/param. The default for most local setups. Noticeable but tolerable quality drop.
Q3 and below: getting into “it’s running, but it’s dumber” territory.

A Llama 3.1 70B at Q4_K_M needs ~43 GB total (weights + KV cache + overhead). Two 3090s handle it. One 3090 barely squeezes it at Q2 with a tiny context window, and at that point you’ve degraded the model so much it’s not really Llama 70B anymore — it’s a lobotomized version of it.

For the full rundown on how GGUF quantization works and why Q4_K_M became the default, see Quantization Explained: GGUF Q4, Q8, FP16.

What the 3090 actually runs

Numbers below are typical community-benchmark ranges on a single 3090 unless noted. They’ll vary with prompt length, batch size, runner (llama.cpp vs vLLM vs Ollama), KV cache settings, and whether you’ve tweaked --n-gpu-layers. Treat them as ballpark, not gospel.

Model	Params	Quant	VRAM used	Tok/s (3090)	Notes
Llama 3.1 8B	8B	Q4_K_M	~6 GB	90–120	Snappy. Leaves plenty of VRAM for big context.
Llama 3.1 8B	8B	Q8	~9 GB	70–90	Quality pick for an 8B.
Llama 3.1 70B	70B	Q4_K_M	~43 GB	15–22 (dual 3090)	The classic dual-3090 use case.
Llama 3.1 70B	70B	Q3_K_S	~31 GB	10–14 (dual 3090)	Fits, but Q4 is the better floor.
Gemma 3 27B	27B	Q5_K_M	~20 GB	28–38	Google’s open-weights champ fits comfortably.
Qwen 2.5 32B	32B	Q4_K_M	~20 GB	25–34	Alibaba’s coder-friendly model.
DeepSeek R1 Distill 32B	32B	Q4_K_M	~21 GB	22–30	Reasoning model; tok/s drops with long thinking traces.
Phi-4 14B	14B	Q5_K_M	~11 GB	55–75	Microsoft’s small-but-mighty.
Mistral Small 3	24B	Q4_K_M	~15 GB	35–48	Low-latency generalist.
SDXL	image	FP16	~10 GB	~2 img/s (1024×1024, 30 steps)	Stability AI’s workhorse.
FLUX.1-dev	image	FP8	~16 GB	~1 img / 8–10 sec	Black Forest Labs’ flagship; fits at FP8.
Whisper Large v3	ASR	FP16	~4 GB	Realtime × ~25	OpenAI’s speech-to-text; transcribe hours in minutes.

A few things jump out. First: an 8B model flies. If you’re building a coding assistant that just needs to answer fast, you’re over-specced with a 3090 and you’d never notice a 4090. Second: the 24 GB ceiling is exactly high enough to fit Gemma 3 27B and Qwen 32B at useful quants, which is why those models have become the default “one card, serious work” picks. Third: FLUX.1-dev at FP8 fits. That used to be a dream.

Credit where it’s due on this whole table: llama.cpp (Georgi Gerganov and 900+ contributors) is what makes these numbers achievable on a consumer card. Ollama layered a clean UX on top. vLLM (Berkeley’s Sky Computing Lab) is what you want for multi-user serving. None of this was possible in 2022. The 3090 aged well because the software aged well around it.

New to any of these tools? Start with Install Ollama in 10 Minutes and Open WebUI: ChatGPT Experience, Self-Hosted.

3090 vs 4090 vs 5090 — head-to-head

Honest comparison. No trash talk — all three are good cards for different budgets.

Spec	RTX 3090 (used)	RTX 4090 (used/new)	RTX 5090 (new)
Typical price (2026)	$600–$850	$1,400–$1,800	$2,200–$2,800
VRAM	24 GB GDDR6X	24 GB GDDR6X	32 GB GDDR7
Memory bandwidth	936 GB/s	1,008 GB/s	1,792 GB/s
Tensor cores	3rd gen (Ampere)	4th gen (Ada, FP8)	5th gen (Blackwell, FP4)
TDP	350W	450W	575W
Tok/s: Llama 3.1 8B Q4	90–120	130–180	200–280
Tok/s: Llama 3.1 70B Q4	15–22 (dual)	22–32 (dual)	28–40 (single, 32 GB just fits at Q3)
$/GB VRAM	~$30	~$65	~$78
$/tok/s (8B Q4)	~$6	~$10	~$10

Honest read:

The 4090 is ~30–50% faster on smaller models. If you’re doing high-volume 8B serving, or you value raw latency for a single-user coding assistant, it’s the better card. You pay double for it.
The 5090 is the only one of the three that fits Llama 3.1 70B at any quant on a single card, and it has FP4 support which matters for cutting-edge models like FLUX.1 and newer diffusion work. It’s also nearly triple a used 3090.
The used 3090 is untouched on $/VRAM. If your workload is “I want to run a 30B class model at Q5 or a 70B at Q4 with two cards,” you’re paying less per usable gigabyte than with any other option NVIDIA sells.

There’s no loser here. There’s only “what’s your actual workload, and what are you paying per watt of electricity?”

3090 vs data-center pulls — P40, P100, A4000, A5000

This is where Bitcoin pleb instincts pay off. You know how to buy used ASICs with confidence — the same mindset applies to ex-enterprise GPUs. Pulls from decommissioned racks are a legitimate budget path, with caveats.

Card	VRAM	Arch	TDP	Typical used price	Form factor	Tok/s note
Tesla P40	24 GB GDDR5	Pascal	250W	$150–$250	Passive blower (needs fan mod)	~30–40% of 3090 on same model
Tesla P100	16 GB HBM2	Pascal	250W	$120–$180	Passive blower	Decent fp16, limited by 16 GB ceiling
RTX A4000	16 GB GDDR6	Ampere	140W	$550–$700	Single-slot active blower	~60% of 3090
RTX A5000	24 GB GDDR6	Ampere	230W	$800–$1,100	Dual-slot active blower	~85% of 3090

P40 — the poverty-tier 24 GB card. It runs llama.cpp and Ollama just fine. But: no bf16, no FP8, no tensor-core acceleration in most inference paths, passive cooling that needs a squirrel-cage fan strapped to the shroud. It’s a project. The r/LocalLLaMA sidebar has a whole section on the fan mod. If you’re already comfortable dealing with heatsinks and 240V shed wiring, a P40 for $200 is a legitimate way to dip your toe in. Don’t expect it to keep up with a 3090.

P100 — oddball. 16 GB HBM2 gives it unreasonable memory bandwidth for the era, so it’s not slow. But 16 GB caps you at 13B models comfortably. It’s a specialty pick for people who want cheap fp16 training experiments.

A4000 — workstation Ampere in a single-slot blower. If you’re rack-mounting in a quiet office, this card is the sensible choice. 16 GB is a real ceiling — no Gemma 27B, no 32B models at useful quants.

A5000 — direct 3090 competitor. Same 24 GB, same Ampere generation, blower cooler, 230W TDP, ECC memory. For a Hashcenter-style rack setup where noise and airflow matter, the A5000 is genuinely the better card. You pay $200–$300 more and give up ~15% tok/s for enterprise-grade cooling and reliability.

Recommendation matrix:

You are…	Pick
Budget-constrained, handy with fans and risers	P40 (24 GB for $200)
Office/rack, noise-sensitive	A5000 (24 GB blower)
Shed with 240V, open-air frame, want max perf/$	3090 (still king)
Need 16 GB max and dead-quiet single-slot	A4000
Want to tinker with fp16 training on a budget	P100

Buying a used 3090 — pleb checklist

Used GPU markets in 2026 are decent but you’ve got to buy like you’d buy a used S19. Here’s the full checklist.

Where to source:

eBay — widest selection; favor sellers with return policies and “local pickup available” (fewer shipping-damage cases).
Facebook Marketplace / Craigslist / Kijiji — best prices, test before paying, bring a laptop with a reference image and nvidia-smi on a live USB.
r/hardwareswap — community heat, reputation system, generally honest. Read seller confirmation threads.
Local classifieds — gold for hands-on inspection.

Avoid unless priced accordingly:

Ex-mining cards with no maintenance proof. Not because mining destroys cards (the myth is overblown — 24/7 at steady load is often gentler than gaming cycles) but because cards that ran in dusty environments with no filter maintenance can have caked-on dust in the VRAM heatsinks and dried-out thermal pads. If the seller shows you a clean teardown photo and a repaste receipt, it’s fine. If they’re cagey, walk.

Red flags on inspection:

Rattling bearings — spin the fans by hand. Any grinding or wobble = fan replacement ($20 + an afternoon).
Corroded contacts — PCIe connector pins should be bright gold. Green or dark spots mean humidity damage.
Yellow/dark thermal paste — old paste that’s been heat-cycled to death. Budget a repaste.
Bent PCIe fingers — runs, maybe. Or maybe won’t seat properly in your board.
Missing backplate screws — card’s been opened. Ask why. Repasted? Pad-reflowed? Could be fine, could be a salvage job.
Founders Edition with rattle near the VRM — known issue with some FE cards; thermal pad degradation on memory modules.

Burn-in before trusting:

Don’t skip this. Before you commit to building a rig around a used card, prove it’s stable for at least four hours under real load.

# 1. Quick sanity — nvidia-smi should show the card at expected clocks
nvidia-smi

# 2. Memory stress — checks VRAM for errors
# Use gpu_burn or memtest_vulkan
./gpu_burn 3600   # 1 hour
# or
memtest_vulkan   # full VRAM sweep

# 3. Real inference load — the workload that matters
ollama run llama3.1:70b-instruct-q4_K_M --verbose
# then in another terminal, loop prompts for an hour

# 4. llama-bench for repeatable numbers
llama-bench -m /path/to/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -p 512 -n 128 -r 10

Watch for: temperature above 85C sustained, power draw spiking above 370W (should cap at 350), tok/s dropping over time (thermal throttling), memory errors in dmesg, visual artifacts in the output text (extremely rare but possible on damaged VRAM).

Warranty status (2026):

EVGA: out of business since 2022; all warranties are dead.
Gigabyte: most 3-year warranties expired 2023–2024. Check card date code.
MSI: same story.
ASUS ROG: some extended warranties still valid if you’re the original owner with a receipt. Rare on the used market.
Founders Edition: NVIDIA’s direct warranty is long gone.

Bottom line: price it as zero-warranty hardware. That’s the correct mental model.

What to pay in 2026:

Rough / ex-mining, no accessories: $550–$650
Clean condition, original box, no paperwork: $700–$800
Mint, boxed, low-hour, original owner with receipt: $800–$900
3090 Ti (same gen, faster, NVLink-capable): add $150–$250 to the above

Don’t pay retail 2020 prices. Don’t lowball a clean card into the ground either. The market found its level.

Building a multi-3090 Hashcenter

Here’s the pleb dream build — and if you’re already running mining gear, you have 80% of the infrastructure.

The 4× 3090 open-air rig:

96 GB combined VRAM — comfortable Llama 3.1 70B at Q5, DeepSeek V3 distills, Qwen 72B at Q4, Command R+ at Q4, frontier-class fine-tuning with QLoRA.
1,400W total GPU draw at full tilt — you’re in 240V territory, which you already have.
Roughly the footprint of two S19s on an open-air aluminum frame.
Same cooling instincts — airflow, ambient, intake/exhaust balance. A shed that handled mining handles this.

Hardware shopping list:

Motherboard: a used server board with 4+ x16 PCIe slots. ASUS WS C621E, Supermicro X11-series, or an EPYC board like the Asrock Rack ROMED8-2T. $300–$600 used on eBay. Pleb tip: watch for “decommissioned lab” listings.
PCIe risers: rated 4.0 x16 risers, not the cheap mining risers (those are 1.0 x1 for a reason — they worked for ASICs because ASICs don’t need bandwidth). LinkUp or C-Payne branded cables. $60–$120 per card.
Power: a server PSU with a breakout board (HP/Dell 1200–1600W PSUs from eBay for $40–$80) per pair of cards, or two high-quality ATX PSUs (Seasonic Prime, Corsair AX) with a dual-PSU sync cable.
Frame: aluminum mining frame, 6-GPU size ($80). You’ll use 4 slots and have room for airflow.
CPU + RAM: modest. A used Xeon with 64 GB ECC is plenty. Inference barely touches the CPU.

Software stack:

Ollama: handles multi-GPU automatically in recent versions (0.4+). Set OLLAMA_NUM_GPU or let it auto-detect. Easiest option.
llama.cpp: use --tensor-split 1,1,1,1 to shard across 4 cards evenly, or weight the split toward your primary card if you want to reserve one for other workloads.
vLLM: for multi-user serving, vLLM with tensor parallelism (--tensor-parallel-size 4) is the right answer. Higher throughput than llama.cpp for concurrent requests.

For the full shed-to-server conversion playbook — wiring, noise, heat recapture, the works — see From S19 to Your First AI Hashcenter. And if you’re curious about using the waste heat productively, Heating With Inference covers the math.

When NOT to pick a 3090

The honest section. The 3090 isn’t right for everyone.

You need >24 GB single-card VRAM. If your target workload is Llama 3.1 70B at Q6+ on a single GPU, or frontier-class full-parameter fine-tuning, or batch inference with huge KV caches, you want more memory per card. That means:

RTX 5090 (32 GB) — consumer tier, frontier-capable
RTX 6000 Ada (48 GB) — workstation, $5,000+ used
H100 80GB / MI300X 192GB — data-center only, Hashcenter-only budgets

You need NVLink. Ampere consumer NVLink was removed starting with the 4090. The 3090 technically supports NVLink via a bridge, but the 3090 Ti doesn’t, and the bridges themselves are rare and overpriced on the used market. For inference workloads, NVLink mostly doesn’t matter (tensor-parallel over PCIe 4.0 x16 is fine). For training with large activation memory, it can matter — but if you’re doing serious training, you’re past the consumer tier anyway.

You need rack form factor with a blower. The 3090 is a 3-slot open-air axial-fan card. In a closed rack, it will cook itself. Pick an A5000 or A4000 for rack work. The A5000 is genuinely the right card here; we said it earlier and we’ll say it again.

You need sub-10W idle power. A Mac Studio M3 Ultra or M4 Max with 96–192 GB of unified memory idles at a few watts and can run surprisingly large models — the unified-memory architecture is legitimately a different design philosophy, and Apple deserves credit for it. You give up tok/s (Apple’s memory bandwidth is lower than a 3090’s) and ecosystem (CUDA still dominates), but for a pleb who runs an LLM a few times a day and wants the electric bill to stay invisible, it’s a real answer. Not a 3090 replacement for heavy users, but worth knowing about.

You’re buying brand new in 2026. The 3090 is strictly a used-market play now. If you’re walking into a store with a fresh budget, the 4090 is the better new buy — more efficient, FP8 support, warranty intact. The 3090 earns its slot on price and availability, not on being the newest or fastest card.

Closing

For Llama 3.1 70B and everything below it — which is 95% of what a self-hosting pleb actually runs — the used RTX 3090 at $600–$800 is still the benchmark. Credit to NVIDIA for building Ampere to last, and credit to the open-source inference crew (llama.cpp, Ollama, vLLM, and the GGUF quantization community) for turning a 2020 gaming card into the 2026 self-hosted AI standard.

The dual-3090 rig is the sovereign AI sweet spot: 48 GB of combined VRAM, roughly 1,300W at the wall when both are working, and comfortable coverage of every meaningful open-source LLM workload from coding assistants to 70B-class reasoning models. If you’ve got the shed, the breaker, and a working knowledge of PCIe risers, you’ve got the bones of a self-hosted AI Hashcenter that will serve you for years.

For the why-this-matters, read the Sovereign AI for Bitcoiners Manifesto. For the practical getting-started path, the Pleb’s Guide to Self-Hosted AI walks through the full stack. And when you’re comparing runners for your new rig, LM Studio vs Ollama vs llama.cpp is the next stop.

Verdict stands: in 2026, the used RTX 3090 is still the pleb’s king of local LLMs. Long may it reign.

Further reading (external):

Ollama model library — ready-to-pull models
Hugging Face model leaderboards — benchmark context for the models above
llama.cpp on GitHub — the inference engine behind most of the numbers in this post
r/LocalLLaMA — where the used-GPU benchmarks actually get posted

Miner Comparison Tool Compare any two miners head-to-head — specs, profitability, and home mining suitability.

Try the Calculator

Bitaxe Heatsink — High-Performance Aluminum Cooler for Bitaxe & Nerdaxe Gamma / Supra / Ultra — Silent Operation & Stable Overclocking" width="80" height="80" loading="lazy" style="width:80px;height:80px;object-fit:contain;border-radius:6px;background:#1A1A1A;flex-shrink:0;">

Premium Bitaxe Heatsink — High-Performance Aluminum Cooler for Bitaxe & Nerdaxe Gamma / Supra / Ultra — Silent Operation & Stable Overclocking 14.90 $CAD

Shop Heatsinks

Used RTX 3090 for LLMs in 2026: Still King?

The 3090’s unfair advantage — 24 GB VRAM

What the 3090 actually runs

3090 vs 4090 vs 5090 — head-to-head

3090 vs data-center pulls — P40, P100, A4000, A5000

Buying a used 3090 — pleb checklist

Building a multi-3090 Hashcenter

When NOT to pick a 3090

Closing

D-Central Technologies

Related Posts

Self-Hosted AI Troubleshooting: GPU Not Detected, OOM, Slow Tokens

Nostr on Self-Hosted AI: Bring Your Identity to Your Own Inference Stack

The Pleb’s Guide to Self-Hosted AI

Related products, repair, and setup paths