Run DeepSeek Locally in Canada: Models, VRAM, and Sovereignty Guide
DeepSeek is a Chinese AI research lab — an unlikely contributor to Canadian digital sovereignty. And yet, because DeepSeek releases its model weights under permissive open licences, it has become one of the most practically useful families of locally runnable AI models available today. That is worth saying plainly: credit where credit is due. The open-source and open-weight AI community — including DeepSeek, the Ollama project, the llama.cpp team, unsloth, and dozens of community quantizers — built the infrastructure that makes private, local AI inference genuinely accessible. D-Central builds hardware and consults on deployments; the software that runs on that hardware was built by this community.
This page covers the practical facts: which DeepSeek models are worth running locally, exactly how much VRAM each variant needs, and the commands that get you from zero to inference with Ollama or llama.cpp. The sovereignty framing — why this matters for Canadian businesses and why running locally is categorically different from calling the DeepSeek API — is woven throughout.
Which DeepSeek models can you actually run locally?
DeepSeek has released several generations of open-weight models. Not all of them are equally practical to run on hardware you own in Canada — the full flagship models require server-grade infrastructure. Here is an honest mapping of what is available as of June 2026:
The R1 distilled series — your practical starting point
DeepSeek-R1 is a reasoning model: it thinks through problems step by step before answering, making it notably stronger on logic, maths, and coding than a standard instruction model of similar size. The full R1 model is 671 billion parameters (Mixture of Experts architecture), but DeepSeek also released a set of distilled versions — dense models at 7B, 8B, 14B, 32B, and 70B — trained to replicate R1’s reasoning style on much more accessible hardware.
These distilled models are built on well-tested base architectures (Llama and Qwen), which means Ollama, llama.cpp, and LM Studio all support them natively with no extra configuration. In May 2026, DeepSeek released DeepSeek-R1-0528, a significant update to the full R1 that raised AIME 2025 maths benchmark accuracy from 70% to 87.5% and reduced hallucination rates. Distilled versions of the 0528 update are available on Hugging Face (see unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF), though community quantisations are still catching up with the main model — check Hugging Face for the latest GGUF builds from bartowski and unsloth.
DeepSeek-V3 — the full general model
DeepSeek-V3 (671B parameters, MoE architecture, ~37B active per token) is a general instruction model — not a reasoning specialist. It is on par with frontier general assistants. Running the full model locally requires roughly 376 GB of memory across all quantisation levels — that is multi-server territory. In practice, most local deployments use the distilled R1 variants instead.
DeepSeek-V4 Pro and Flash — the 2026 frontier (April 2026 release)
On April 24, 2026, DeepSeek released two new open-weight models: V4-Pro (1.6 trillion total parameters, MoE) and V4-Flash (284 billion parameters). Both are MIT-licensed with weights on Hugging Face. V4-Flash is the one worth considering for self-hosted deployments: at aggressive quantisation it can run in approximately 33 GB of VRAM (two RTX 4090s or one RTX 6000 Ada), rising to roughly 80 GB at FP8 on an H100. V4-Pro at 1.6T parameters requires infrastructure that is not practical for most organisations. Both models support a 1-million-token context via API; local context size depends on your hardware. Community GGUF builds are emerging — verify on Hugging Face before deploying, as this is a recent release and quantisation quality is still being validated.
Licence summary
The DeepSeek-R1 series, DeepSeek-V3, and DeepSeek-V4 are all released under the MIT Licence, which permits commercial use, modification, and derivative works including fine-tuning and distillation. There are no royalty requirements. The licence applies to the model weights themselves; if you use DeepSeek’s cloud API rather than downloading the weights, their API terms of service apply separately — that is one more reason to run locally.
Source: DeepSeek-R1 on Hugging Face (MIT Licence); DeepSeek-V4 on Hugging Face (MIT Licence).
How much VRAM does each DeepSeek model need?
VRAM is the binding constraint for local inference. All figures below are for 4-bit quantisation (Q4_K_M) via GGUF unless stated — this is the practical default for most local deployments, balancing quality and memory efficiency. Add roughly 1–2 GB overhead for the KV cache at typical context lengths. Figures are approximate; your actual usage may vary slightly based on context length, batch size, and the specific quantisation build.
| Model | Parameters | Approx. VRAM (Q4_K_M) | Runs on | Best for |
|---|---|---|---|---|
| R1-Distill-Qwen-7B or R1-Distill-Llama-8B |
7–8B | ~5–8 GB | 8 GB GPU (RTX 3070, 4060 Ti 16GB); Apple Silicon 16 GB | Entry-level reasoning; fast responses; everyday assistant tasks |
| R1-Distill-Qwen-14B | 14B | ~8–10 GB | 12 GB GPU (RTX 3080, 4070); Apple Silicon 24 GB | Stronger reasoning; good coding assistant; document Q&A |
| R1-Distill-Qwen-32B | 32B | ~18–20 GB | Used RTX 3090 (24 GB) — the value pick; RTX 4090; Mac Studio M4 Max 36 GB | Near-frontier reasoning; strong maths and code; SMB workloads |
| R1-Distill-Llama-70B | 70B | ~40–48 GB | 2× RTX 3090; Apple Silicon 64–96 GB unified memory; single H100 at lower quant | Highest-quality distilled reasoning; team inference server |
| V3 full / R1 full (671B MoE) | 671B | ~376 GB (across all quant levels) | Multi-GPU server cluster (8× H100 or similar) | Hashcenter / hyperscale deployment; not for single-machine local use |
| V4-Flash (284B MoE) | 284B | ~33 GB (aggressive quant); ~80 GB FP8 | 2× RTX 4090 (min. quantised); single H100 at FP8 | General frontier-quality assistant; community GGUF builds emerging — verify |
| V4-Pro (1.6T MoE) | 1.6T | ~80 GB+ (smallest viable quant); ~170 GB full weights | 2× H200 minimum; dedicated Hashcenter infrastructure | Frontier-quality open-weight general model; infrastructure investment required |
Sources: bartowski/DeepSeek-R1-GGUF, unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF, willitrunai.com GPU guide. VRAM figures are approximate; ~1–2 GB overhead for KV cache at default context lengths. Verify against current model card before purchasing hardware.
For a broader model-by-GPU matchup, the Local LLM VRAM Calculator lets you estimate requirements across model families and quantisation levels. The GPU Comparison for Local LLMs covers which cards deliver best tokens-per-second for your budget.
Running DeepSeek locally with Ollama
Ollama is an open-source tool developed by the Ollama team that wraps llama.cpp in a clean API and CLI, making local LLM setup as simple as pulling a container. It is the fastest path from zero to a running DeepSeek model on Linux, macOS, or Windows. Full credit to the Ollama team for making this accessible.
Step 1: Install Ollama
On Linux:
curl -fsSL https://ollama.com/install.sh | sh
On macOS and Windows, download the installer from ollama.com/download. Verify you have version 0.5.7 or newer before pulling DeepSeek models.
Step 2: Pull and run a DeepSeek model
Pick the largest model your VRAM can fit. Ollama downloads the GGUF weights automatically:
# 7B — fits on 8 GB VRAM, ~4.5 GB download
ollama run deepseek-r1:7b
# 8B (Llama-based distill) — similar VRAM, strong coding
ollama run deepseek-r1:8b
# 14B — needs ~10 GB VRAM, ~9 GB download
ollama run deepseek-r1:14b
# 32B — needs ~20 GB VRAM, ~20 GB download (RTX 3090 target)
ollama run deepseek-r1:32b
# 70B — needs ~48 GB VRAM or two GPUs
ollama run deepseek-r1:70b
The R1-0528 updated distills (if you want the May 2026 improvements) are available via Hugging Face GGUF repos. Ollama can pull from Hugging Face directly:
# R1-0528 8B update (Qwen3-based) — via unsloth's GGUF build
ollama pull hf.co/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF:Q4_K_XL
Check unsloth on Hugging Face for the latest 0528 GGUF builds. Model availability changes as the community releases new quantisations.
Step 3: Interact or integrate
Once running, Ollama exposes a local API at http://localhost:11434. You can chat in the terminal, connect Open WebUI for a browser interface, or call the API from your application using the OpenAI-compatible endpoint. No internet connection required after the initial model download.
Running DeepSeek locally with llama.cpp
llama.cpp — built by Georgi Gerganov and the open-source community — is the underlying inference engine that Ollama wraps. Using it directly gives you more control over quantisation selection, GPU layer offloading, context length, and batch size. This path is better for production deployments, low-level benchmarking, or situations where you want to mix GPU and CPU RAM (useful if you are slightly under the VRAM threshold for a model).
Step 1: Build or install llama.cpp
# Clone and build with CUDA support (NVIDIA GPU)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
For macOS with Apple Silicon, replace -DGGML_CUDA=ON with -DGGML_METAL=ON. Pre-built binaries are also available in the GitHub releases.
Step 2: Download a DeepSeek GGUF file
The community quantises DeepSeek models to GGUF format. Recommended sources:
- bartowski/DeepSeek-R1-GGUF — full R1 671B at multiple quantisation levels (for multi-GPU setups)
- Mungert/DeepSeek-R1-Distill-Qwen-32B-GGUF — 32B distill
- unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF — 8B distill, multiple quant levels
Quantisation tier guidance:
- Q4_K_M — recommended default; best balance of quality and VRAM
- Q8_0 — near-lossless quality; needs roughly double the VRAM of Q4; best for the 7B or 8B on a card with headroom
- Q2_K / IQ2_XXS — aggressive compression; fits larger models in less VRAM but quality degrades noticeably on reasoning tasks; use only if VRAM is the hard limit
Download using huggingface-cli or wget directly from Hugging Face model card links.
Step 3: Run inference
# Replace paths and adjust -ngl (GPU layers) to match your VRAM
./build/bin/llama-cli
--model /path/to/DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.gguf
--ctx-size 4096
--n-gpu-layers 64
--threads 8
--prompt "Explain the difference between PIPEDA and Quebec Law 25."
The --n-gpu-layers flag controls how many model layers go to the GPU. Set it to a high number (e.g., 99) to push everything to VRAM; reduce it if you overflow to allow CPU RAM offloading at the cost of slower inference. For a server endpoint, use llama-server instead of llama-cli to expose the OpenAI-compatible API.
For a comparison of Ollama vs llama.cpp vs vLLM for different use cases, see /ollama-vs-vllm-vs-llama-cpp/.
Why running locally matters for Canadian data sovereignty
Downloading and running DeepSeek weights locally is not the same as calling DeepSeek’s cloud API. When you call the API, your prompts travel to DeepSeek’s servers in China — raising questions that are distinct from using a US-hosted provider but no less real. Running the open weights locally means your data never crosses any border. That distinction matters in three specific Canadian contexts:
Quebec Law 25 (Act respecting the protection of personal information in the private sector)
Law 25 (enacted 2021–2023, fully in force as of September 2023) requires that organisations conducting cross-border transfers of personal information carry out a privacy impact assessment (PIA) and ensure adequate protection in the receiving jurisdiction before transferring. A local deployment — where personal information is processed on premises and never transmitted to a third party — avoids the cross-border transfer entirely. There is no transfer to assess. This is the cleanest posture available under Law 25. It is not an automatic compliance guarantee (other obligations still apply), but it removes the cross-border exposure by design. This is orientation, not legal advice — consult a qualified privacy lawyer or your organisation’s Privacy Officer for advice specific to your situation. Source: Commission d’accès à l’information du Québec (CAI), cai.gouv.qc.ca. See also our Law 25 Privacy Impact Assessment guide and RAG for Canadian Businesses (Law 25).
PIPEDA and its provincial equivalents
Canada’s federal privacy law (the Personal Information Protection and Electronic Documents Act, PIPEDA, and successor framework being developed under Bill C-27 — note: Bill C-27 was not passed into law as of the date of this writing; confirm current legislative status with legal counsel) requires that personal information remain protected when disclosed to third parties, including across borders. Using a foreign AI API to process personal information is a disclosure to a third party. Processing it locally on equipment you own and control is not. Orientation only — not legal advice.
The US CLOUD Act and foreign-hosted providers
The US Clarifying Lawful Overseas Use of Data (CLOUD) Act allows US law enforcement to compel US-based providers to produce data even when it is stored in Canada. This matters to Canadian businesses that use US-headquartered AI providers — even data nominally held in a Canadian AWS or Azure region can be reachable. Running DeepSeek weights on hardware you own in Canada involves no US provider at any point in the inference path. See our CLOUD Act and Canadian AI guide for the full breakdown.
The API trap: DeepSeek cloud is not local inference
DeepSeek’s cloud API is fast and inexpensive. It is also routed to servers outside Canada. Calling the DeepSeek API is sovereign AI in the name only — the model is open, but your data is not local. The whole point of open weights is that you can download and run the model yourself, with no dependency on the vendor’s infrastructure. That is the path this guide covers.
For the broader framing of why this matters — owning your AI capability as infrastructure, not renting it — see Sovereign AI in Canada and Local LLMs in Canada.
Hardware starting points for Canadian deployments
The right hardware depends on which model tier you are targeting. These are practical starting points — not an exhaustive list, and prices change. Verify current availability and pricing before purchasing; D-Central can source and configure these for Canadian customers.
| Target model tier | Hardware option | VRAM | Notes |
|---|---|---|---|
| R1 7B / 8B | RTX 3070 8 GB, RTX 4060 Ti 16 GB, RTX 3080 10 GB | 8–16 GB | Entry-level local AI; capable everyday assistant |
| R1 14B | RTX 3080 12 GB, RTX 4070 | 12–16 GB | Solid step up; fits on mid-range gaming GPU |
| R1 32B | RTX 3090 (used), RTX 4090, Mac Studio M4 Max 36 GB | 24–36 GB | A used RTX 3090 is the value pick for VRAM-per-dollar in Canada; ~20 GB needed leaves headroom at 24 GB |
| R1 70B | 2× RTX 3090, Mac Studio M4 Max 96–128 GB | 48–128 GB | Apple Silicon unified memory is uniquely suited here; ~12 tok/s on 70B Q4 on M4 Max 128 GB |
| V4-Flash 284B | 2× RTX 4090 (minimum at aggressive quant) | ~33 GB minimum | Community GGUF builds still maturing — verify before purchasing for this use case |
D-Central builds hand-configured local AI workstations for Canadian businesses. Browse Sovereign AI for current options, or book a Sovereignty Briefing to get a hardware recommendation specific to your model target, budget, and workload — quote-only, build-to-order, shipped Canada-wide.
For a full comparison of GPU options with real tokens-per-second benchmarks, see GPU for Local LLM Comparison. For a broader open-weight model family comparison that puts DeepSeek in context against Llama, Qwen, and others, see Open-Weight AI Canada Comparison.
Quantisation: what the numbers mean
If you have read the setup guides above and encountered Q4_K_M, Q8_0, IQ2_XXS, and want to know what they mean practically: see the dedicated AI Quantisation Guide (INT4, INT8, FP16). The short version: quantisation compresses model weights to use less memory, trading a small amount of quality for a large reduction in VRAM requirement. Q4_K_M is the practical default for most deployments and preserves most of the model’s capability while halving the memory requirement versus FP16. For reasoning-heavy tasks like DeepSeek-R1, Q4_K_M is generally preferred over more aggressive quantisations that can hurt multi-step reasoning quality.
Air-gapped deployments for regulated environments
If your regulatory environment or threat model requires that the inference machine never touch the internet — even for the initial model download — you can pre-download the GGUF weights on a connected machine and transfer via encrypted drive. Ollama supports a fully offline model directory with no outbound calls during inference. For regulated industries (legal, healthcare, government, financial services), this is the strongest privacy posture available. See Air-Gapped AI Coding in Canada for a complete setup guide. The Distributed Compute page covers team-scale inference architectures.
Frequently asked questions
Is DeepSeek open source?
DeepSeek releases its model weights under the MIT Licence, which permits commercial use, modification, and derivative works. The weights are publicly available on Hugging Face. The training code and infrastructure are not fully open — this is “open weights,” not open source in the fullest sense. The distinction matters: you can download, run, and fine-tune DeepSeek; you cannot independently verify the training data or reproduce the training run. Credit to DeepSeek for the permissive weight release; the weights themselves are the useful artefact for local deployment.
Is it safe to use DeepSeek in Canada given that it is a Chinese company?
This is a legitimate question with two distinct answers depending on how you use it. Calling DeepSeek’s cloud API routes your data to servers outside Canada — that is a data transfer to consider under Law 25 and PIPEDA, and to evaluate based on your own threat model and regulatory obligations. Running DeepSeek’s open weights locally means your data never touches DeepSeek’s infrastructure at any point — the model weights sit on hardware you own and control in Canada. The open-weight local deployment model severs the vendor dependency entirely. For regulatory questions specific to your organisation, consult qualified legal counsel — this is orientation, not legal advice.
What is the difference between DeepSeek-R1 and DeepSeek-V3?
DeepSeek-R1 is a reasoning-specialist model: it works through problems step by step, producing a visible chain of thought before giving a final answer. It is notably stronger at maths, logic, and coding. DeepSeek-V3 is a general instruction model — faster and better for conversation, writing, and open-ended tasks that do not require multi-step reasoning. For most local deployments, R1 distilled variants are the better starting point because the reasoning capability transfers well to small model sizes.
How fast will DeepSeek run on my machine?
Speed depends on your GPU’s memory bandwidth (not just VRAM capacity) and the model size. As a practical reference: a used RTX 3090 running DeepSeek-R1-Distill-Qwen-32B at Q4_K_M should produce roughly 15–25 tokens per second — a comfortable interactive pace. The 7B model on a 24 GB card runs at 60–80+ tokens per second. These figures are community-reported estimates; your specific setup will vary. Use the VRAM Calculator to estimate requirements, then benchmark after deployment.
Can I use DeepSeek locally for commercial work in Canada?
The MIT Licence on DeepSeek’s weights permits commercial use with no royalty requirements. Running the model locally means you are not subject to DeepSeek’s API terms of service. Standard intellectual property and privacy obligations still apply to the content you generate — consult legal counsel if you have specific compliance questions. This is orientation, not legal or financial advice.
What is DeepSeek-R1-0528?
DeepSeek-R1-0528 is an updated version of the full R1 model released on 28 May 2026. Key improvements: AIME 2025 maths accuracy increased from 70% to 87.5%, average reasoning depth roughly doubled (from ~12,000 to ~23,000 tokens per problem), reduced hallucination rate, and added JSON output and function-calling support. Distilled versions of 0528 are available on Hugging Face under the same MIT licence. Sources: DeepSeek changelog, VentureBeat.
Can D-Central set this up for me?
Yes. We source the hardware, configure Ollama or llama.cpp, select and quantise the right model for your workload, test the deployment, and ship it ready to use. Everything is quote-only and build-to-order — book a Sovereignty Briefing to start with a written recommendation, or browse Sovereign AI for pre-configured builds available to ship Canada-wide.
- Local LLMs in Canada — the complete overview
- Local LLM VRAM Calculator — estimate your hardware requirements
- Open-Weight AI Canada Comparison — DeepSeek vs Llama vs Qwen and others
- AI Quantisation Guide (INT4, INT8, FP16)
- Ollama vs vLLM vs llama.cpp
- Cloud vs Local AI — total cost of ownership comparison
- Sovereign AI in Canada
- Law 25 Privacy Impact Assessment guide
- Air-Gapped AI Coding in Canada
- RAG for Canadian Businesses (Law 25 compliant)
Related products, repair, and setup paths
- self-hosted AI for Bitcoiners hub
- plebs guide to self-hosted AI
- install Ollama in 10 minutes
- LM Studio vs Ollama vs llama.cpp
- connect local AI to Home Assistant and Obsidian
- self-hosted AI troubleshooting
- repurpose mining hardware into an AI hashcenter
- local AI model leaderboards
Last reviewed June 18, 2026.
