Local LLM Setup Checklist — 8-Phase Guide for Canadian Organizations

Setting up a local LLM has eight distinct phases — hardware sizing, OS prep, runtime install, model selection, RAG pipeline, Canadian privacy posture (Law 25), network hardening, and ongoing maintenance. Skip any phase and you either end up with a system that underperforms its hardware, exposes internal data to the network, or sits outside your organization’s privacy obligations. Work through this checklist once; your progress is saved automatically in your browser.

This checklist is the companion to D-Central’s local LLM Canada guide and the local AI hardware guide. For VRAM sizing before you purchase hardware, use the interactive VRAM calculator. For a managed deployment or consulting engagement, see AI sovereignty consulting.

We stand on the shoulders of the open-source inference community — llama.cpp (Georgi Gerganov et al.), Ollama, vLLM, HuggingFace Transformers, Chroma, and LlamaIndex. None of what follows requires D-Central software; this checklist works with any compatible stack.

Overall progress

0 %
0 / 0 items

Phase 1
Hardware selection & sizing

▼

The most common mistake is buying hardware before sizing the model. Work through items 1–3 before spending anything. See the full local AI hardware guide for GPU comparison tables.

Define your primary use case
Choose one: general-purpose chat · code assistant · document/RAG Q&A · long-context summarization · multilingual (FR/EN) · image understanding. This determines minimum context window, model family, and quantization requirements.
Estimate VRAM requirement from the quick-reference table below (or use the VRAM calculator)
Rule of thumb for Q4_K_M quants: multiply parameter count (billions) × ~0.6 GB and add ~1–2 GB for KV cache at 4K context. Example: 7B × 0.6 = 4.2 GB + 1.5 GB cache ≈ 5.7 GB — fits a 6 GB GPU only at short contexts; 8 GB recommended. Use the full calculator to model your actual context length.
⚠ Purchase hardware AFTER completing this item, not before.
Select GPU tier appropriate for your target model
Consumer (≤24 GB VRAM): RTX 3090 24 GB, RTX 4090 24 GB, RTX 4060 Ti 16 GB — suitable for 7B–27B Q4 models. Prosumer (48 GB): 2× RTX 4090 (multi-GPU offload), RTX 6000 Ada 48 GB. Enterprise (80 GB): A100 80 GB, H100 80 GB — required for 70B+ models in a single card.
Confirm CPU: ≥8 cores with AVX2 support
CPU matters for tokenization, pre/post-processing, and layers offloaded to RAM when GPU VRAM is insufficient. AVX2 is required by most inference runtimes for efficient CPU operations. Verify: lscpu | grep avx2 (Linux) or CPU-Z (Windows).
Plan system RAM: at minimum 2× your GPU VRAM
If your GPU has 24 GB VRAM, plan for ≥48 GB system RAM to allow layer offloading for models that slightly exceed VRAM. For enterprise GPU nodes (80 GB), 256 GB system RAM is typical. Insufficient RAM causes the system to swap to disk — inference becomes unusably slow.
Storage: NVMe SSD with ≥2 TB free capacity
Models accumulate quickly: a 7B Q4 model is ~4.5 GB, a 70B Q4 model is ~43 GB. A small library of 5–10 models can consume 100–200 GB. RAG vector indexes, embeddings, and logs add more. NVMe (not SATA SSD) speeds model loads significantly at startup.
Confirm PSU wattage: GPU TDP + 20% safety headroom
RTX 4090 TDP = 450 W. A system with one RTX 4090, high-core-count CPU, and NVMe drives can draw 650–750 W at peak. A 850 W 80+ Gold PSU is the minimum; 1000–1200 W recommended for headroom and efficiency. For multi-GPU: add 450 W per additional card.
Verify physical fit: GPU length, PCIe slot spacing, case airflow clearance
High-end GPUs (RTX 4090, A100) are physically large — verify GPU length fits your case, PCIe x16 slot spacing allows the card, and there is adequate airflow (minimum 50 mm clearance above GPU cooler). Rack-mount AI servers solve this but cost more.

VRAM quick-reference (Q4_K_M quantization, 4 K context)

Approximate figures from Ollama model library and published HuggingFace model cards, June 2026. Actual VRAM usage varies with context length and batch size — use the VRAM calculator for your specific configuration. Verify before purchasing.

Model	Params	Q4_K_M VRAM (approx.)	Minimum GPU	License
Qwen2.5 3B / Llama 3.2 3B	3 B	~2.1–2.3 GB	6 GB (RTX 3060)	Apache 2.0 / Meta License
Gemma 3 4B	4 B	~3.0 GB	6 GB (RTX 3060)	Google Gemma ToS
Qwen2.5 7B / DeepSeek-R1 7B	7 B	~4.7 GB	8 GB (RTX 3070)	Apache 2.0 / MIT
Llama 3.1 8B / Mistral 7B v0.3	7–8 B	~5.1–5.4 GB	8 GB (RTX 3070)	Meta License / Apache 2.0
Mistral Nemo / Gemma 3 12B	12 B	~7.1–8.1 GB	10 GB (RTX 3080)	Apache 2.0 / Google Gemma ToS
Qwen2.5 14B / Phi-4 14B	14 B	~9.0–9.3 GB	12 GB (RTX 3080 12G)	Apache 2.0 / MIT
Gemma 3 27B	27 B	~17 GB	24 GB (RTX 3090)	Google Gemma ToS
Qwen2.5 32B / DeepSeek-R1 32B	32 B	~20–21 GB	24 GB (RTX 3090/4090)	Apache 2.0 / MIT
Llama 3.1 70B / Qwen2.5 72B	70–72 B	~43–47 GB	2× 24 GB or 1× A100 80 GB	Meta License / Apache 2.0
Llama 3.1 405B	405 B	~240 GB+	Multi-GPU / multi-node	Meta License

Colour guide: ■ ≤8 GB GPU ■ ≤16 GB GPU ■ ≤24 GB GPU ■ ≤48 GB GPU ■ 80+ GB / multi-GPU. Figures verified at Ollama model library and HuggingFace model cards, June 2026 — check current cards before buying.

Phase 2
OS, drivers & environment

▼

Install or confirm OS: Ubuntu 22.04/24.04 LTS recommended
Ubuntu 22.04 LTS (Jammy) or 24.04 LTS (Noble) give the widest driver and runtime support. Windows 11 with WSL2 is an acceptable alternative for existing Windows infrastructure. macOS (Apple Silicon, M3/M4) works well with Metal-backed runtimes like Ollama. Avoid non-LTS Ubuntu releases for production inference servers.
Install NVIDIA drivers (stable branch) and verify with nvidia-smi
Use the production/stable driver, not the latest beta. On Ubuntu: sudo apt install nvidia-driver-550 (verify latest stable version at nvidia.com at time of install). After reboot, nvidia-smi must show correct VRAM and driver version — if it shows “No devices were found”, the driver is not loaded.
Install CUDA Toolkit (version matched to your driver)
Check the CUDA–driver version compatibility table at docs.nvidia.com. Most inference runtimes require CUDA 12.x as of mid-2026. Verify: nvcc --version. On Ubuntu: install via the CUDA runfile from nvidia.com, not via apt, for best version control.
(AMD GPUs only) Install ROCm and verify your GPU is on the supported list
ROCm support is narrower than CUDA. Verify your GPU model at rocm.docs.amd.com before purchasing AMD hardware for inference. Consumer cards (RX 7900 XTX) are supported from ROCm 6.x but with some limitations vs CUDA.
Install Python 3.10+ and virtualenv tooling
Many RAG frameworks (LlamaIndex, LangChain) require Python 3.10 or 3.11. Use pyenv to manage multiple Python versions without conflicting with the system Python. Always run inference workloads in a virtualenv or conda environment — never install to system Python.
Configure firewall to default-deny inbound before any service is installed
On Ubuntu: sudo ufw default deny incoming && sudo ufw allow ssh && sudo ufw enable. This ensures no inference API port is accidentally exposed to the network the moment a runtime starts. Add explicit allow rules only as needed in Phase 7.
⚠ Do not skip — Ollama and llama.cpp default to binding 0.0.0.0 (all interfaces) if not configured otherwise.

Phase 3
Inference runtime installation

▼

Choose your inference runtime
Ollama: recommended starting point — single binary, built-in model manager, OpenAI-compatible REST API, automatic GPU/CPU splitting. llama.cpp: lowest overhead, maximum configurability, ideal when you need to tune every parameter. vLLM: production throughput with PagedAttention — use when serving multiple concurrent users (requires CUDA). LM Studio: GUI option for Windows/macOS single-user workstations.
Install chosen runtime following official upstream documentation
Ollama: curl -fsSL https://ollama.com/install.sh | sh (Linux). llama.cpp: clone + make LLAMA_CUDA=1. vLLM: pip install vllm in a fresh virtualenv. Always install from the project’s official source — not third-party mirrors.
Run smoke test with a small model and verify GPU acceleration is active
With Ollama: ollama run qwen2.5:3b, then in another terminal nvidia-smi and confirm GPU memory is consumed. If GPU VRAM usage stays at zero, the runtime is running on CPU only — check CUDA installation and driver version compatibility.
Verify the API endpoint responds correctly on localhost
Ollama: curl http://localhost:11434/api/tags — should return a JSON list of models. llama.cpp server: default port 8080. vLLM: default port 8000 with OpenAI-compatible /v1/chat/completions. Confirm it only listens on 127.0.0.1 (not 0.0.0.0) before proceeding.
Configure context window size appropriate for your use case
Default context (num_ctx) is often 2048–4096 tokens. Each additional 1,024 context tokens consumes roughly 200–500 MB of VRAM (varies by model architecture and KV cache dtype). For document Q&A, 8K–32K context is common; for simple chat, 4K is sufficient. Larger context = more VRAM consumed regardless of actual prompt length.
Configure runtime as a system service (for persistent deployment)
Ollama installs a systemd service automatically (systemctl status ollama). For llama.cpp or vLLM, write a systemd unit file so the service restarts after reboots. Without this, the inference server stops when you close the terminal.

Phase 4
Model selection & download

▼

Match model family to your use case
General chat / reasoning: Llama 3.1/3.3, Qwen2.5, Gemma 3. Code generation: Qwen2.5-Coder, DeepSeek-Coder-V2. Reasoning / step-by-step: DeepSeek-R1 distilled variants (7B, 14B, 32B). Bilingual FR/EN: Qwen2.5 (strong multilingual), Mistral Nemo 12B, or multilingual-e5 for embeddings only. Vision/multimodal: LLaVA, Qwen2-VL, Gemma 3 12B-IT (multimodal).
Confirm the model fits your GPU VRAM at your chosen context window
Cross-reference the VRAM table in Phase 1 with your GPU’s installed VRAM. Remember: actual VRAM usage = model weights + KV cache. KV cache grows linearly with context length. When in doubt, use the VRAM calculator with your exact parameters.
Select quantization tier: Q4_K_M is the recommended starting point
Q4_K_M: best quality-per-VRAM-gigabyte ratio for most use cases — start here. Q8_0: near-lossless vs FP16, but uses ~2× VRAM of Q4. Q5_K_M: good midpoint if VRAM allows. IQ4_XS: higher compression, slightly more quality loss, useful for squeezing larger models into tighter VRAM. Avoid Q2 quantizations for anything requiring factual accuracy.
Download model via Ollama or direct GGUF from HuggingFace
Ollama: ollama pull qwen2.5:14b (handles quantization selection automatically). Direct GGUF: use huggingface.co — search for the model name + GGUF. Trusted GGUF publishers include bartowski, unsloth, and the model’s original org. Prefer models from the original model organization when available.
Verify model checksum / SHA256 hash for files downloaded outside Ollama
Ollama verifies integrity automatically. For manual GGUF downloads: compare the SHA256 listed on the HuggingFace model card with the downloaded file (sha256sum model.gguf). A tampered model could produce subtly incorrect outputs or, in a worst case, contain malicious payloads.
⚠ Do not skip for any file not downloaded through Ollama’s verified pipeline.
Review and document the model’s license for your intended use
Apache 2.0 (Qwen2.5, Mistral 7B, DeepSeek-R1 distills): permissive commercial use. Meta Llama 3 Community License: commercial use permitted but with restrictions above 700M MAU — read the full license at llama.meta.com. Google Gemma ToS: non-commercial or limited commercial — verify your specific use case complies. MIT (DeepSeek-R1, Phi): most permissive. This is orientation only; consult legal counsel for commercial licensing decisions.
Do NOT load confidential organizational data into a model you did not train
Fine-tuning with proprietary data is a separate, advanced step outside this checklist. The models you download were trained on public internet data. Confidential data belongs in the RAG layer (Phase 5), not baked into model weights, because RAG keeps data under your control and separately auditable.
Record model name, version tag, and quantization in your deployment log
Example: qwen2.5:14b-instruct-q4_K_M — pulled 2026-06-15 via Ollama 0.3.x. This is your reproducibility trail and audit record — required for Law 25 documentation (Phase 6).

Phase 5
RAG pipeline setup

▼

Retrieval-Augmented Generation (RAG) lets your LLM answer questions about your internal documents without baking that data into model weights. All retrieval happens locally — no data leaves your hardware. Skip this phase if you only need a general-purpose assistant with no internal knowledge base.

Define your document corpus: what internal knowledge must the model access?
Common corpora: internal policy/procedure documents, product manuals, Confluence/Notion exports, customer support tickets, code repositories. Be precise — RAG quality degrades when the corpus contains irrelevant noise. Start with a focused, high-quality subset; expand later.
Choose and install an embedding model
English-only: nomic-embed-text (via Ollama — 768 dim, strong performance) or all-MiniLM-L6-v2 (fast, lower quality). FR/EN bilingual: multilingual-e5-large or paraphrase-multilingual-mpnet-base-v2. Embedding models are small (200 MB–1.5 GB) and typically run on CPU without impacting GPU inference.
Choose and install a local vector store
Chroma: simplest to start, embedded mode (no separate server), ideal for single-user or prototyping. Qdrant (local mode): production-ready, Docker-deployable, strong filtering capabilities. LanceDB: embedded, no server, columnar storage — good for large corpora. All run fully locally with no cloud dependency.
Build ingestion pipeline: parse → chunk → embed → store
Recommended chunk settings: 512–1,024 tokens per chunk, 10–20% overlap between chunks. Overlap preserves context across chunk boundaries. For PDFs: use pdfminer or pypdf. LlamaIndex and LangChain provide ready-made document loaders for most formats (PDF, Markdown, Word, CSV, HTML). Run the full ingestion and verify no documents were silently skipped.
Evaluate retrieval quality: run 5–10 known queries and verify returned chunks
For each test query, confirm the top-3 retrieved chunks actually contain the expected information. If retrieval is poor: (a) reduce chunk size, (b) switch embedding model, (c) add metadata filters. Poor retrieval will make the full RAG pipeline fail regardless of model quality — don’t skip this validation step.
Connect retrieval layer to the LLM inference endpoint
LlamaIndex: VectorStoreIndex + RetrieverQueryEngine with an Ollama LLM. LangChain: RetrievalQA chain. Both support the Ollama OpenAI-compatible endpoint. Test end-to-end: ask a question only answerable from your corpus and verify the answer cites the retrieved document.
Set data retention policy for indexed documents and conversation logs
If your corpus contains personal data (employee records, customer data), define how long the vector embeddings and raw documents are retained, who can update or delete them, and what happens at end-of-contract. This is directly relevant to your Law 25 obligations (Phase 6). Document the policy in writing.

Phase 6
Quebec Law 25 / Canadian privacy posture

▼

This section provides orientation only — it is NOT legal advice. Quebec’s Loi modernisant des dispositions législatives en matière de protection des renseignements personnels (Law 25) is complex and organization-specific. For binding compliance guidance, consult a qualified privacy lawyer or contact the Commission d’accès à l’information (CAI) du Québec at cai.gouv.qc.ca. All legal claim dates: June 2026.

Confirm all inference compute runs on hardware physically located in Canada
When inference runs on your own Quebec hardware, personal data in prompts never crosses international borders. This substantially simplifies your Law 25 cross-border transfer assessment obligations (per CAI guidance, September 2023 phase). For cloud-GPU scenarios, data does cross borders — a full Privacy Impact Assessment (PIA) is then required per Law 25.
Verify no personal data flows to external US cloud APIs during normal operation
Audit your application code for any fallback calls to OpenAI/Anthropic/Gemini APIs. The US CLOUD Act (18 U.S.C. § 2713) allows US government to compel disclosure of data held by US-based technology companies anywhere in the world — data processed by local models on Canadian hardware is not reachable under this authority. This is the core sovereignty argument for local LLMs.
Inventory what personal data your LLM system may process
Examples: employee names in HR documents indexed in RAG, customer names/emails in support ticket RAG, health information in a medical context. Under Law 25 (enforced September 2023), organizations must maintain a documented inventory of personal information held. Your LLM system is in scope if it processes or can access personal data.
Conduct a Privacy Impact Assessment (PIA) if the system accesses personal data of Quebec residents
Under Law 25, a PIA is mandatory before deploying a technology project involving personal information (per Section 63.1 of the Act). The PIA must assess risks, propose mitigation measures, and be documented. Your local deployment is an advantage here — data residency in Quebec is itself a significant risk-reduction measure. Template PIA frameworks are available from the CAI.
⚠ This is a legal requirement, not a best practice — consult your privacy lawyer before going live with personal data.
Confirm your organization has a designated Privacy Officer
Law 25 requires all organizations subject to Quebec privacy law to designate a Privacy Officer (the person who, by default, is the CEO — it can be delegated). The Privacy Officer’s name must be published on your website. This requirement has been in force since September 2022 (Phase 1 of Law 25).
Add the AI system to your privacy inventory with purpose, retention, and third-party access documented
Your privacy inventory must include: (1) categories of personal data processed, (2) purpose of processing, (3) retention schedule, (4) whether data is shared with any third parties (and if so, under what agreement), (5) safeguards in place. For a fully local LLM, third-party sharing is typically none — document this explicitly as it is a key compliance strength.
Set inference log retention to the minimum period necessary
Inference logs may contain personal data (users’ questions often include names, project details, health information). Law 25 requires personal data to be destroyed once the purpose is fulfilled. Define a maximum retention period (e.g., 90 days), automate deletion, and document it in your privacy inventory. Do NOT keep logs indefinitely as a default.
Plan your privacy incident response procedure for the LLM system
Under Law 25, a privacy incident involving personal data with a risk of serious injury must be reported to the CAI and affected individuals. Define what constitutes a “privacy incident” for your LLM system (e.g., unauthorized access to the inference server, RAG index exfiltration), who is responsible for assessment, and the 72-hour reporting timeline to CAI.

Phase 7
Network security hardening

▼

Restrict the inference API to localhost or a named internal VLAN — deny all external access
Ollama: set OLLAMA_HOST=127.0.0.1:11434 in the systemd environment file, or bind to a specific LAN IP rather than 0.0.0.0. Firewall rule: ufw deny from any to any port 11434 (then allow from trusted subnet only). Verify the port is not reachable from outside: nmap -p 11434 <server-external-ip>.
⚠ An unprotected Ollama endpoint exposed to the internet allows anyone to run inference on your GPU and access your RAG index.
If serving the LAN: put a TLS-terminating reverse proxy (Nginx or Caddy) in front
Never expose the raw Ollama/llama.cpp HTTP endpoint on a LAN without TLS — inference traffic including prompts travels in plaintext. Caddy: simplest TLS setup with automatic Let’s Encrypt (or self-signed CA for internal LAN). Nginx: more control. Both can proxy to 127.0.0.1:11434 with rate limiting and access logging.
Add API key authentication if multiple users or services call the endpoint
Ollama (as of 0.2+): set OLLAMA_AUTH_TOKEN environment variable. Alternatively, implement API key validation at the Nginx reverse proxy layer (simpler to rotate keys without restarting the inference service). Use a different key per team or integration — makes it easy to revoke individual access.
Restrict SSH access to the inference server to named admin keys only
In /etc/ssh/sshd_config: set PasswordAuthentication no, PubkeyAuthentication yes, and limit AllowUsers to a minimal set. The inference server holds your RAG index (which may contain confidential documents) — treat it like a data server, not a dev machine.
Optionally air-gap model downloads: disable outbound access to model registries after initial setup
If your threat model includes supply-chain compromise (malicious model updates), block outbound to ollama.com and huggingface.co after models are pulled. Pull new models manually via an offline transfer (copy GGUF via internal network). Prevents an attacker who compromises the inference server from pulling a backdoored model.
Enable access logging: record inference request metadata (timestamp, model, source IP, prompt length — not necessarily prompt content)
Logging prompt content raises its own Law 25 issues — log metadata only unless you have specific reason and a documented retention policy for content logs. Metadata logs help detect abuse (unusually large or frequent requests) and support incident response if the server is compromised.
Define and document your incident response plan for a compromised inference server
Key questions: (1) Who is alerted first? (2) How quickly can the service be isolated (kill switch)? (3) Where are model weights backed up so they can be restored to a clean machine? (4) What personal data may have been exposed (per your Phase 5 corpus)? (5) Does this trigger a Law 25 incident report to the CAI?

Phase 8
Monitoring & ongoing maintenance

▼

Install GPU monitoring: nvitop or nvidia-smi dmon for live temperature, VRAM, and utilization tracking
Install: pip install nvitop. Run: nvitop for an interactive htop-like GPU monitor. For headless servers: nvidia-smi dmon -s pucvmet -d 5 logs metrics every 5 seconds. For production: integrate GPU metrics into Prometheus + Grafana via nvidia-dcgm-exporter.
Set a thermal alert: sustained GPU temperature above 85 °C requires investigation
NVIDIA consumer GPUs throttle at ~83–88 °C and risk longevity damage above 90 °C sustained. Causes: dust accumulation on heatsink, inadequate case airflow, or undervolted fans. Check immediately if temperature alerts trigger. For Hashcenter AI Node deployments, D-Central’s thermal management guidelines apply — see AI sovereignty consulting.
Configure disk space monitoring with alert at ≥85% disk usage
A disk-full event will crash Ollama/llama.cpp model loads and corrupt in-progress downloads. Set up df -h cron check or add to your Prometheus node exporter scrape. Clean up old model versions regularly — each major model update can leave a full copy of the prior version on disk.
Schedule monthly model update review
Check for security-relevant updates to your inference runtime (Ollama, llama.cpp patch releases) and for improved versions of your active model. Note: updating models is not mandatory on the same cadence as OS patches — evaluate new versions for your use case before rolling out. Maintain a changelog of model versions in production.
Back up model weights to external or secondary storage
Models are large but re-downloadable from HuggingFace/Ollama — so backup is lower priority than for unique data. However, if you are in an air-gapped or bandwidth-constrained environment, losing model files means a long re-download. Store a copy on an internal NAS or object storage. For fine-tuned models (if you trained your own), backup is critical — those weights are irreplaceable.
Test full disaster recovery procedure at least once annually
From a clean machine: (1) install OS, (2) install drivers and runtime, (3) restore model weights, (4) restore RAG index, (5) verify end-to-end inference. Time it. Document the runbook. A system you’ve never restored from scratch has an unknown recovery time — which matters when the system is handling organization-critical work.
Review your privacy posture and Law 25 compliance annually — regulations evolve
Quebec Law 25 is still being interpreted by regulators. The CAI publishes guidance periodically. Annual review of your PIA, privacy inventory, and incident response plan is recommended practice. If your use case or the regulatory environment changes significantly, commission a fresh legal review.

Need help with any phase? D-Central offers AI sovereignty consulting covering hardware selection, inference stack deployment, RAG pipeline architecture, and Law 25 posture — for Canadian organizations that want sovereign, fully on-premises AI. All systems run on hardware physically located in Canada; no US cloud dependency.

Frequently asked questions

How much VRAM do I need to run a local LLM?

The minimum practical entry point for a useful general-purpose model is 8 GB VRAM, which runs 7B–8B parameter models at Q4_K_M quantization (e.g., Qwen2.5 7B, Llama 3.1 8B) with a 4 K token context window. For a 14B model — noticeably more capable for complex reasoning — you need at least 10–12 GB VRAM. For 32B models, 24 GB (e.g., RTX 3090 or RTX 4090) is the minimum. The VRAM quick-reference table above gives approximate figures for the most popular open-weight models as of June 2026; use the VRAM calculator to model your specific context length and batch size.

Does running a local LLM help with Quebec Law 25 compliance?

Self-hosting your inference on hardware physically located in Quebec significantly simplifies your Law 25 compliance posture because personal data in user prompts never crosses provincial or international borders. Quebec Law 25 requires a Privacy Impact Assessment (PIA) before transferring personal data outside Quebec — a fully local deployment eliminates this cross-border transfer, removing a major compliance obligation. However, Law 25 still applies to how you store, process, and log data locally. Phase 6 of this checklist covers the key obligations. This is orientation, not legal advice — consult the Commission d’accès à l’information (CAI) du Québec or a qualified privacy lawyer for binding guidance.

What is RAG and do I need it for a local LLM?

Retrieval-Augmented Generation (RAG) lets your LLM answer questions about your internal documents without retraining or fine-tuning the model. A retrieval layer fetches relevant chunks from a local vector database at query time and passes them to the LLM as context. You need RAG if you want the model to access organization-specific knowledge — internal policies, product manuals, support history, or any documents that post-date the model’s training cutoff. If you only need a general-purpose AI assistant with no internal knowledge requirements, you can skip Phase 5 entirely. For Canadian organizations, a major sovereignty benefit of RAG is that your documents stay in your local vector store — they are never uploaded to a third-party API.

Which inference runtime should I start with — Ollama, llama.cpp, or vLLM?

For most teams starting their first local LLM deployment, Ollama is the recommended starting point: single binary install, automatic GPU detection, built-in model management, and an OpenAI-compatible REST API that your existing tools can connect to immediately. Move to llama.cpp if you need fine-grained control over quantization, context window, or CPU-GPU split for unusual hardware. Use vLLM if you are serving multiple concurrent users and need maximum throughput — vLLM’s PagedAttention architecture significantly outperforms Ollama at high concurrency but requires more setup complexity and a CUDA-capable GPU. All three are open-source and free; the choice is about operational complexity vs performance for your specific workload.

How do I prevent my local LLM server from being exposed to the internet?

The key actions are covered in Phase 7: (1) configure the inference runtime to bind to 127.0.0.1 (localhost) rather than 0.0.0.0 (all interfaces), (2) add a firewall deny rule for the inference port from external sources, and (3) if you need LAN access, put a TLS-terminating reverse proxy with API key authentication in front. By default, both Ollama and llama.cpp server bind to all interfaces — you must actively restrict this. Verify with an external port scan after hardening.

Can I run a local LLM entirely offline, with no internet connection after setup?

Yes. Once your inference runtime, model weights, and RAG index are downloaded, the system requires no internet connection for inference. The only components that need internet are: (a) model downloads during initial setup, (b) embedding model downloads, and (c) any external document sources you configure for RAG ingestion. For high-security environments, you can download everything on a connected machine and transfer via an internal network or physical media — this is called an “air-gap” deployment. See Phase 7, item 5 for the Ollama air-gap configuration. This is a strong fit for organizations where internet access from the AI server represents a security or sovereignty risk.

Related products, repair, and setup paths

Last reviewed July 24, 2026.