Skip to content

Bitcoin accepted at checkout  |  Ships from Laval, QC, Canada  |  Expert support since 2016

The short answer: A Gemma 4 E4B QAT needs roughly 8 GB of VRAM; Qwen3‑27B at Q4 needs ~18 GB; Llama 4 Scout at INT4 needs ~55 GB; and DeepSeek V4 Pro at INT4 requires a 6–8×H100 GPU cluster — it is not a single-workstation model. Match your model tier to hardware before you buy; the table below maps every major open-weight model to a real, purchasable Canadian configuration.

Running an AI model locally is no longer exotic — it is the practical choice for any Canadian organization that cannot route sensitive data through US cloud infrastructure. Whether you are a solo developer experimenting with Gemma or an enterprise team that needs DeepSeek‑class reasoning on a private GPU cluster, the critical variable is always the same: does your hardware have enough memory to hold the model?

This guide maps today’s most-requested open-weight models to verified hardware tiers, from a compact desktop box to an enterprise GPU hashcenter. All VRAM figures are approximate and vary with context window length, batch size, and quantization tool; treat them as planning minimums, not hard ceilings. We flag every external figure with its source so you can verify independently.

D-Central Technologies builds and ships this hardware to Canadian customers. We stand on the shoulders of the open-source projects — llama.cpp, Ollama, vLLM, Unsloth, GGUF — that made local inference practical, and we credit them throughout.

Model-tier table: VRAM requirements and hardware match

The table below shows approximate VRAM requirements for each model at its most practical quantization level for local inference, plus the D-Central hardware tier that covers that footprint. “Approximate VRAM” is the weight footprint only; add 10–30% headroom for KV cache and runtime overhead at normal context lengths.

Model Quant Approx. VRAM
(weights only)
Inference framework D-Central hardware tier Use case
Gemma 4 E4B QAT
Google DeepMind, 2025
QAT/INT4 ~5–6 GB
min. 8 GB recommended
Ollama Pleb AI Box Personal assistant, edge deployment, laptop-class tasks
Qwen3‑27B
Alibaba Cloud, 2025
Q4 (GGUF) ~16.8 GB
~18 GB total w/ overhead
Ollama · llama.cpp Sovereign AI Workstation 24 Developer productivity, light RAG, small team (1–5 users)
Llama 4 Scout
Meta, 2025 — 109B total, 17B active (MoE)
INT4 ~55 GB
tight at 48 GB; comfortable at 64 GB+
Ollama · vLLM Workstation 48 / Apple‑128 (tight on 48 GB variant)
or Hashcenter AI Node 80+
Advanced reasoning, coding assistance, medium team (5–20 users)
NVIDIA DGX Spark
GB10 Grace Blackwell, 128 GB unified
Multiple 128 GB unified
runs Llama 4 Scout INT4 (~55 GB) and Qwen3‑35B Q8 (~37 GB)
llama.cpp · vLLM NVIDIA DGX Spark Power user / small team, high quality at desktop scale
DeepSeek V4 Pro
DeepSeek, 2025 — MoE, ~1 T+ total parameters
INT4 ~430–470 GB
⚠ NOT a single-workstation model. All expert weights must be loaded; active-parameter count does not reduce memory footprint.
vLLM (expert parallelism) Hashcenter AI consulting — 6–8×H100 cluster design Enterprise reasoning, code generation at scale, sovereign data centres

Sources: Gemma 4 QAT figures — Google DeepMind QAT post, Unsloth Gemma 4 QAT docs. Qwen3‑27B Q4 — Will It Run AI 2026. Llama 4 Scout INT4 — APXML system-requirements guide. DGX Spark — NVIDIA official DGX Spark page. DeepSeek V4 Pro INT4 — Codersera V4 VRAM guide 2026. All figures are approximate; actual VRAM usage increases with context window length and concurrent batch size.

How many concurrent users? Sizing the right tier

VRAM covers the model weight; concurrency determines whether you need Ollama’s simplicity or vLLM’s throughput engine. Use the decision tree below to route to the right tier.

Start here: how many people will send queries at the same time?

1 user (developer / personal)
→ Any tier. Ollama on a Pleb AI Box or Workstation 24 is sufficient. Focus on model quality, not throughput.

2–5 users (small team)
→ Workstation 24 (Qwen3‑27B or smaller) or Workstation 48/Apple‑128 (Llama 4 Scout). Ollama works; consider vLLM if latency matters.

6–20 users (department / SMB)
→ Workstation 48, Apple‑128, or Hashcenter AI Node 80+. Switch to vLLM. At 20 concurrent users, vLLM delivers approximately 793 tokens/s aggregate throughput versus Ollama’s ~41 tokens/s — roughly 19× more headroom before queues back up (quantizelab.dev 2026 benchmark series; figures vary by model and GPU).

20+ users or production API (enterprise)
→ Hashcenter AI Node 80+ (single node, smaller models) or a multi-node H100 cluster (large models like DeepSeek V4 Pro). vLLM is mandatory. Contact us for a sizing consultation.

You want DeepSeek V4 Pro specifically
Read the cluster-class warning below before purchasing any single-node hardware.

Why concurrent users matter more than raw VRAM

A single-user query on Qwen3‑27B is fast on almost any hardware. The constraint appears when five people hit the model simultaneously: Ollama queues requests serially and degrades predictably past 5–10 concurrent sessions. vLLM’s continuous batching and PagedAttention KV-cache scheduler were built specifically to maximize GPU utilization under concurrent load. If your team is larger than five people, plan the inference stack before you plan the hardware.

Inference framework: Ollama vs vLLM — which one for your tier?

Both Ollama and vLLM are open-source, run on Linux, and serve GGUF or HuggingFace-format models. The decision is almost entirely about concurrency.

Factor Ollama vLLM
Setup complexity Low — single binary, ollama run model Medium — Python env, CUDA, config
Single-user latency Excellent Excellent
Throughput at 20+ users ~41 TPS (degrades, queue backlog) ~793 TPS (continuous batching scales)
Multi-GPU / tensor parallel Limited Native (required for cluster models)
Model format support GGUF primary HuggingFace, AWQ, GPTQ, GGUF
OpenAI-compatible API Yes Yes
Best for Pleb AI Box, Workstation 24, development Workstation 48+, Hashcenter AI Node, cluster

Throughput benchmark source: quantizelab.dev 2026 vLLM vs Ollama benchmark guide. Figures measured under concurrent user load; single-user performance gap is much narrower. Benchmark hardware and model size affect absolute numbers — verify independently for your specific model.

Practical recommendation

NVIDIA DGX Spark — 128 GB unified memory, desktop form factor

The DGX Spark deserves its own section because its 128 GB of unified LPDDR5x memory (shared coherently between the GB10 Blackwell GPU and the Grace CPU) changes what “desktop inference” means. Unlike a discrete GPU workstation — where model weights compete with KV cache inside a fixed VRAM envelope — the DGX Spark’s unified pool allows the OS to manage overflow gracefully, at the cost of some bandwidth versus true HBM. NVIDIA cites up to 273 GB/s bandwidth and 1 petaFLOP at FP4 precision (source: NVIDIA official DGX Spark product page).

What the DGX Spark runs comfortably

What the DGX Spark does not run

DeepSeek V4 Pro at INT4 requires approximately 430–470 GB of memory (weights alone) — more than three times the DGX Spark’s 128 GB. Two DGX Sparks networked together via NVLink or Ethernet provide 256 GB total, which is still insufficient. DeepSeek V4 Pro is a cluster-class model; see the section below.

View the NVIDIA DGX Spark (D-Central) →

DeepSeek V4 Pro — cluster-class model, not a workstation

Important sizing warning: DeepSeek V4 Pro is a Mixture-of-Experts model with over one trillion total parameters. Unlike dense models where active parameters approximate total load, MoE architectures require all expert weights to be resident in memory because the router selects different experts per token unpredictably. At INT4 quantization, this translates to approximately 430–470 GB of GPU memory — before KV cache or runtime overhead. No single-node workstation available today accommodates this. Plan for a 6–8×H100 80 GB cluster (480–640 GB total VRAM) as the minimum viable deployment.

The data-jurisdiction question

DeepSeek’s model weights are released under the MIT licence and can be run entirely on-premises. However, using DeepSeek’s cloud API — even for testing — routes your data through servers in the People’s Republic of China. A MIT licence does not change the data-jurisdiction reality of the API endpoint. If your compliance requirements prohibit cross-border data transfer to non-allied jurisdictions (Quebec Law 25, federal privacy principles, financial services mandates), run the weights on your own hardware or not at all.

When DeepSeek V4 Pro makes sense

For organizations that need frontier-class reasoning sovereignty — legal, financial, defence-adjacent — DeepSeek V4 Pro on a sovereign multi-GPU cluster is a compelling option. The weights are open; the compute is yours. D-Central designs these hashcenter deployments as part of the AI Sovereignty Consulting service.

D-Central Sovereign AI hardware — the full line

Every tier below is a build-to-order Canadian configuration. Lead times reflect hand-built quality; we do not stock mass-market inventory. Prices are available on request — contact us for a quote.

Pleb AI Box

Consumer GPU, 8–16 GB VRAM

Best for: Gemma 4 E4B, Qwen3‑7B, Mistral 7B — personal sovereignty tier.

Sovereign AI Workstation 24

24 GB VRAM workstation

Best for: Qwen3‑27B Q4, Llama 3.1 8B, Gemma 4 27B — small team up to 5 users.

Sovereign AI Workstation 48 / Apple‑128

48 GB GPU or Apple 128 GB unified

Best for: Llama 4 Scout INT4 (tight on 48 GB), Qwen3‑32B, Mistral Large — team of 5–15.

Hashcenter AI Node 80+

80 GB+ GPU (H100 class)

Best for: Llama 4 Scout INT4 with headroom, Qwen3‑235B at Q4, multi-user production API.

NVIDIA DGX Spark

128 GB unified (GB10 Grace Blackwell)

Best for: Llama 4 Scout INT4, Qwen3‑35B Q8 — desktop AI supercomputer tier.

Hashcenter AI Cluster

6–8×H100 and above

Required for: DeepSeek V4 Pro, Llama 3.1 405B BF16, Qwen3‑235B FP16 — enterprise consulting engagement.

Browse all Sovereign AI hardware →

Why Canadian organizations should run models locally

The 2026 US export-control directives restricting access to frontier AI APIs for foreign nationals underlined what privacy lawyers have been saying for years: a commercial AI contract is only as durable as the counterparty government allows. Canadian organizations relying exclusively on US cloud AI face three structural risks:

Running your models locally eliminates all three risks. The hardware cost is finite; the sovereignty is permanent. See Local LLM Canada and AI consulting Quebec for the full Canadian context, and Distributed Compute for decentralized infrastructure options.

Frequently asked questions

What VRAM do I need to run Llama 4 Scout locally?

Llama 4 Scout at INT4 quantization requires approximately 55 GB of VRAM for the model weights alone. Add 10–20% headroom for KV cache and runtime overhead, meaning a practical minimum of 64 GB. The D-Central Sovereign AI Workstation 48’s 48 GB variant can run it with careful context-window management (tight); the Apple‑128 variant (128 GB unified) runs it comfortably. The Hashcenter AI Node 80+ (80 GB H100-class GPU) provides the most headroom for single-node deployment. All VRAM figures are approximate and vary with batch size and context length.

Can DeepSeek V4 Pro run on a single GPU workstation?

No. DeepSeek V4 Pro is a Mixture-of-Experts model with over one trillion total parameters. At INT4 quantization, the model weights require approximately 430–470 GB of GPU memory — before KV cache or overhead. The largest single-GPU workstations available (H100 SXM, 80 GB) provide 80 GB per card. Even an 8×H100 node (640 GB) requires careful memory management at INT4 to accommodate production context windows. DeepSeek V4 Pro is a cluster-class deployment; there is no single-workstation configuration that runs it reliably at production throughput.

What is the difference between Ollama and vLLM for local AI?

Both are open-source inference servers with an OpenAI-compatible API. Ollama prioritizes ease of use — a single binary, a simple model library, and excellent developer experience for 1–5 users. vLLM prioritizes throughput under concurrency: its continuous-batching scheduler and PagedAttention KV-cache manager allow it to serve dozens of simultaneous requests efficiently. At 20+ concurrent users, benchmarks show vLLM delivering roughly 19× more aggregate throughput than Ollama on comparable hardware (source: quantizelab.dev 2026 benchmark series). For personal and small-team use, start with Ollama. For team or production deployments, plan for vLLM from the start.

What can the NVIDIA DGX Spark run?

The DGX Spark features 128 GB of unified LPDDR5x memory shared between its GB10 Blackwell GPU and Grace CPU. It comfortably runs Llama 4 Scout at INT4 (approximately 55 GB weights) and Qwen3‑35B at Q8 (approximately 37 GB weights), with memory to spare for context and overhead. It cannot run DeepSeek V4 Pro (430–470 GB at INT4) — that requires a multi-node H100 cluster. Specs per NVIDIA’s official product page; throughput figures depend on the inference stack and model configuration.

Is DeepSeek safe to use if it is MIT licensed?

The DeepSeek model weights are MIT-licensed, meaning you can download and run them on your own hardware with no usage restrictions. The licensing question is separate from the data-jurisdiction question: if you use the DeepSeek cloud API rather than self-hosting the weights, your queries are routed to servers in the People’s Republic of China. For organizations subject to Quebec Law 25, federal privacy requirements, or financial-sector compliance mandates, the API should be off limits regardless of the licence. Self-hosting the weights on a local or Canadian-datacenter deployment eliminates this concern entirely.

How do I choose between the Workstation 48 and the Hashcenter AI Node 80+?

The primary difference is VRAM envelope and the models it unlocks. The Sovereign AI Workstation 48 has a 48 GB GPU (plus an Apple‑128 variant with 128 GB unified memory); the Hashcenter AI Node 80+ carries an 80 GB H100-class GPU. If your target model is Llama 4 Scout INT4 (~55 GB weights), the Workstation 48 GPU variant requires context-window discipline to avoid overflow, while the Node 80+ runs it comfortably with headroom for KV cache. If you anticipate upgrading to larger models (Qwen3‑72B, Llama 3.1 70B at higher precision) within 12–18 months, the Node 80+ is the more future-proof choice. Contact us for a sizing consultation if you are uncertain.

Not sure which tier you need?

D-Central’s AI Sovereignty Consulting team works with Canadian organizations to size on-premises AI infrastructure correctly — from the model choice through the inference stack to the network and power requirements. Engagements are quoted individually; no auto-generated estimates.

Request a hardware sizing consultation →

Related resources