Skip to content

Bitcoin accepted at checkout  |  Ships from Laval, QC, Canada  |  Expert support since 2016

Local LLM Setup Checklist — 8-Phase Guide for Canadian Organizations

Setting up a local LLM has eight distinct phases — hardware sizing, OS prep, runtime install, model selection, RAG pipeline, Canadian privacy posture (Law 25), network hardening, and ongoing maintenance. Skip any phase and you either end up with a system that underperforms its hardware, exposes internal data to the network, or sits outside your organization’s privacy obligations. Work through this checklist once; your progress is saved automatically in your browser.

This checklist is the companion to D-Central’s local LLM Canada guide and the local AI hardware guide. For VRAM sizing before you purchase hardware, use the interactive VRAM calculator. For a managed deployment or consulting engagement, see AI sovereignty consulting.

We stand on the shoulders of the open-source inference community — llama.cpp (Georgi Gerganov et al.), Ollama, vLLM, HuggingFace Transformers, Chroma, and LlamaIndex. None of what follows requires D-Central software; this checklist works with any compatible stack.

Overall progress

0 %
0 / 0 items




Phase 1
Hardware selection & sizing

The most common mistake is buying hardware before sizing the model. Work through items 1–3 before spending anything. See the full local AI hardware guide for GPU comparison tables.









VRAM quick-reference (Q4_K_M quantization, 4 K context)

Approximate figures from Ollama model library and published HuggingFace model cards, June 2026. Actual VRAM usage varies with context length and batch size — use the VRAM calculator for your specific configuration. Verify before purchasing.

Model Params Q4_K_M VRAM (approx.) Minimum GPU License
Qwen2.5 3B / Llama 3.2 3B 3 B ~2.1–2.3 GB 6 GB (RTX 3060) Apache 2.0 / Meta License
Gemma 3 4B 4 B ~3.0 GB 6 GB (RTX 3060) Google Gemma ToS
Qwen2.5 7B / DeepSeek-R1 7B 7 B ~4.7 GB 8 GB (RTX 3070) Apache 2.0 / MIT
Llama 3.1 8B / Mistral 7B v0.3 7–8 B ~5.1–5.4 GB 8 GB (RTX 3070) Meta License / Apache 2.0
Mistral Nemo / Gemma 3 12B 12 B ~7.1–8.1 GB 10 GB (RTX 3080) Apache 2.0 / Google Gemma ToS
Qwen2.5 14B / Phi-4 14B 14 B ~9.0–9.3 GB 12 GB (RTX 3080 12G) Apache 2.0 / MIT
Gemma 3 27B 27 B ~17 GB 24 GB (RTX 3090) Google Gemma ToS
Qwen2.5 32B / DeepSeek-R1 32B 32 B ~20–21 GB 24 GB (RTX 3090/4090) Apache 2.0 / MIT
Llama 3.1 70B / Qwen2.5 72B 70–72 B ~43–47 GB 2× 24 GB or 1× A100 80 GB Meta License / Apache 2.0
Llama 3.1 405B 405 B ~240 GB+ Multi-GPU / multi-node Meta License

Colour guide: ■ ≤8 GB GPU   ■ ≤16 GB GPU   ■ ≤24 GB GPU   ■ ≤48 GB GPU   ■ 80+ GB / multi-GPU. Figures verified at Ollama model library and HuggingFace model cards, June 2026 — check current cards before buying.

Phase 2
OS, drivers & environment







Phase 3
Inference runtime installation







Phase 4
Model selection & download









Phase 5
RAG pipeline setup

Retrieval-Augmented Generation (RAG) lets your LLM answer questions about your internal documents without baking that data into model weights. All retrieval happens locally — no data leaves your hardware. Skip this phase if you only need a general-purpose assistant with no internal knowledge base.








Phase 6
Quebec Law 25 / Canadian privacy posture

This section provides orientation only — it is NOT legal advice. Quebec’s Loi modernisant des dispositions législatives en matière de protection des renseignements personnels (Law 25) is complex and organization-specific. For binding compliance guidance, consult a qualified privacy lawyer or contact the Commission d’accès à l’information (CAI) du Québec at cai.gouv.qc.ca. All legal claim dates: June 2026.








Phase 7
Network security hardening








Phase 8
Monitoring & ongoing maintenance








Need help with any phase? D-Central offers AI sovereignty consulting covering hardware selection, inference stack deployment, RAG pipeline architecture, and Law 25 posture — for Canadian organizations that want sovereign, fully on-premises AI. All systems run on hardware physically located in Canada; no US cloud dependency.

Frequently asked questions

How much VRAM do I need to run a local LLM?

The minimum practical entry point for a useful general-purpose model is 8 GB VRAM, which runs 7B–8B parameter models at Q4_K_M quantization (e.g., Qwen2.5 7B, Llama 3.1 8B) with a 4 K token context window. For a 14B model — noticeably more capable for complex reasoning — you need at least 10–12 GB VRAM. For 32B models, 24 GB (e.g., RTX 3090 or RTX 4090) is the minimum. The VRAM quick-reference table above gives approximate figures for the most popular open-weight models as of June 2026; use the VRAM calculator to model your specific context length and batch size.

Does running a local LLM help with Quebec Law 25 compliance?

Self-hosting your inference on hardware physically located in Quebec significantly simplifies your Law 25 compliance posture because personal data in user prompts never crosses provincial or international borders. Quebec Law 25 requires a Privacy Impact Assessment (PIA) before transferring personal data outside Quebec — a fully local deployment eliminates this cross-border transfer, removing a major compliance obligation. However, Law 25 still applies to how you store, process, and log data locally. Phase 6 of this checklist covers the key obligations. This is orientation, not legal advice — consult the Commission d’accès à l’information (CAI) du Québec or a qualified privacy lawyer for binding guidance.

What is RAG and do I need it for a local LLM?

Retrieval-Augmented Generation (RAG) lets your LLM answer questions about your internal documents without retraining or fine-tuning the model. A retrieval layer fetches relevant chunks from a local vector database at query time and passes them to the LLM as context. You need RAG if you want the model to access organization-specific knowledge — internal policies, product manuals, support history, or any documents that post-date the model’s training cutoff. If you only need a general-purpose AI assistant with no internal knowledge requirements, you can skip Phase 5 entirely. For Canadian organizations, a major sovereignty benefit of RAG is that your documents stay in your local vector store — they are never uploaded to a third-party API.

Which inference runtime should I start with — Ollama, llama.cpp, or vLLM?

For most teams starting their first local LLM deployment, Ollama is the recommended starting point: single binary install, automatic GPU detection, built-in model management, and an OpenAI-compatible REST API that your existing tools can connect to immediately. Move to llama.cpp if you need fine-grained control over quantization, context window, or CPU-GPU split for unusual hardware. Use vLLM if you are serving multiple concurrent users and need maximum throughput — vLLM’s PagedAttention architecture significantly outperforms Ollama at high concurrency but requires more setup complexity and a CUDA-capable GPU. All three are open-source and free; the choice is about operational complexity vs performance for your specific workload.

How do I prevent my local LLM server from being exposed to the internet?

The key actions are covered in Phase 7: (1) configure the inference runtime to bind to 127.0.0.1 (localhost) rather than 0.0.0.0 (all interfaces), (2) add a firewall deny rule for the inference port from external sources, and (3) if you need LAN access, put a TLS-terminating reverse proxy with API key authentication in front. By default, both Ollama and llama.cpp server bind to all interfaces — you must actively restrict this. Verify with an external port scan after hardening.

Can I run a local LLM entirely offline, with no internet connection after setup?

Yes. Once your inference runtime, model weights, and RAG index are downloaded, the system requires no internet connection for inference. The only components that need internet are: (a) model downloads during initial setup, (b) embedding model downloads, and (c) any external document sources you configure for RAG ingestion. For high-security environments, you can download everything on a connected machine and transfer via an internal network or physical media — this is called an “air-gap” deployment. See Phase 7, item 5 for the Ollama air-gap configuration. This is a strong fit for organizations where internet access from the AI server represents a security or sovereignty risk.