Skip to content

Bitcoin accepted at checkout  |  Ships from Laval, QC, Canada  |  Expert support since 2016

Private RAG for Canadian Businesses: Law 25 Section 17 Risk & Local Stack Guide

The short answer: Cloud RAG feeds your business documents to a remote model — making every retrieved chunk a potential cross-border transfer of personal information subject to Quebec Law 25 Section 17. A local RAG stack (Open WebUI + a vector database such as ChromaDB or Qdrant + a locally-served open-weight model) keeps documents, embeddings, and inference on hardware you own. No transfer occurs, so the cross-border compliance question becomes much simpler. This page explains the risk, walks through the architecture, and maps models to hardware tiers. It is orientation, not legal advice — confirm anything specific with your counsel or the Commission d’accès à l’information (CAI).

Retrieval-Augmented Generation has moved from research experiment to production tool faster than most compliance frameworks expected. Firms that would never paste a client file directly into ChatGPT are now feeding those same files into cloud RAG pipelines without realizing the risk profile is identical — and in some ways sharper, because RAG systems are designed to surface the most relevant content automatically.

For Canadian businesses, particularly those operating under Quebec Law 25, the question is not whether to use RAG. It is where the RAG stack runs.

What RAG is and why it changes the compliance picture

Retrieval-Augmented Generation is a two-stage AI architecture. In the first stage, your documents are split into chunks, converted into numerical vectors by an embedding model, and stored in a vector database. When a user asks a question, the same embedding model converts the query into a vector, the database finds the most semantically similar chunks, and those chunks are injected into the language model’s prompt as context. The model then generates an answer grounded in your actual documents rather than relying on what it learned during training.

RAG is powerful precisely because the model sees your documents at inference time. That is also what makes the compliance picture different from using a general AI assistant with no document context.

RAG component Cloud RAG — what leaves your walls Local RAG — what stays in-house
Document ingestion Documents uploaded to a cloud service for chunking and embedding Chunked and embedded on your own machine; originals never leave
Embeddings Sent to a cloud embedding API; chunks may be logged Generated locally (e.g., nomic-embed-text via Ollama); no API call
Vector store Managed cloud database; jurisdiction depends on vendor ChromaDB, Qdrant, pgvector, or Weaviate on your server
Retrieved chunks in prompt Real document excerpts sent to remote model at inference time Passed only to the local model; never transmitted
LLM inference Remote API; subject to provider ToS and applicable law Runs on local hardware; no third-party API call

Why cloud RAG triggers Quebec Law 25 Section 17

Orientation, not legal advice. The Commission d’accès à l’information (CAI) is the authoritative source. Confirm your specific situation with counsel.

Section 17 of Quebec’s Act respecting the protection of personal information in the private sector (Loi sur la protection des renseignements personnels dans le secteur privé, RLRQ c P-39.1), as modernized by Act 25 (2021, c. 25), requires organizations to conduct a Privacy Impact Assessment (PIA) before communicating personal information outside Quebec. The PIA must assess whether the receiving jurisdiction offers protection equivalent to Quebec’s standards; where it does not, additional contractual safeguards are required.

Three elements of cloud RAG activate this provision:

  1. Documents containing personal information are chunks in your RAG corpus. Client contracts, HR files, patient notes, financial statements — any document with a person’s name, identity, or financial data is personal information under the Act. When those documents are ingested by a cloud RAG service, they cross the provincial border.
  2. Embedding API calls transmit document excerpts. Cloud embedding APIs (including those offered by major US providers) receive document chunks to produce vector representations. Those chunks may contain personal information. The API call is a transfer.
  3. Retrieved chunks appear in the inference prompt. At query time, the RAG system injects the most relevant document excerpts into the prompt sent to the remote model. Those excerpts are personal information in motion, often to a US jurisdiction governed by the CLOUD Act.

The US CLOUD Act of 2018 allows US authorities, with appropriate legal process, to compel a US-based provider to produce data in its custody or control, even when that data physically sits in a Canadian facility. A Quebec-resident server operated by a US company does not resolve the Section 17 question — legal control and physical location are separate concepts. See The US CLOUD Act and Canadian AI Data for a full breakdown.

The practical upshot: if your RAG pipeline touches a US cloud at any stage — ingestion, embedding, or inference — you are communicating personal information outside Quebec and Section 17 applies. A local RAG stack eliminates all three transfer points.

Local RAG architecture: the sovereign stack

A fully local RAG system has five components, all of which can run on a single server or workstation inside your network. The open-source projects that make this possible — Open WebUI, Ollama, ChromaDB, Qdrant, pgvector, llama.cpp — deserve full credit; sovereign AI for ordinary businesses exists because of their work.

Local RAG pipeline (data flow)

  1. Document ingestion: Upload PDFs, Word files, or plain text to Open WebUI. The system chunks documents automatically (configurable chunk size and overlap).
  2. Local embedding: Chunks are converted to vectors by a local embedding model served via Ollama (e.g., nomic-embed-text, mxbai-embed-large). No API call; all computation on your hardware.
  3. Vector store: Embeddings are written to your chosen local vector database (ChromaDB by default in Open WebUI; Qdrant, Milvus, pgvector, or Weaviate via configuration).
  4. Query and retrieval: User question → local embedding → similarity search in vector DB → top-k most relevant chunks returned.
  5. Generation: Retrieved chunks + user question assembled into a prompt; local LLM (via Ollama) generates the answer. Nothing sent externally.

The entire pipeline — ingestion, embedding, storage, retrieval, generation — runs on hardware in your facility. Prompts, document content, and answers never cross your network perimeter.

Open WebUI: the recommended local RAG interface

Open WebUI (formerly Ollama WebUI, open-webui/open-webui) is an open-source, self-hosted web interface for running local LLMs. It ships with built-in RAG support — document upload, chunking, embedding, and retrieval are all handled in the UI without additional configuration tools. It is MIT-licensed; we credit it here as the project that made local RAG accessible to non-engineers.

Key Open WebUI RAG capabilities (as of June 2026 — verify against the project’s official documentation for current status):

Vector database comparison for local RAG

Vector DB Licence Setup effort Open WebUI native Best for Scalability ceiling
ChromaDB
chroma-core/chroma
Apache 2.0 Minimal — built into Open WebUI; no separate service needed for single-node Default Getting started; single-user or small team; <1M document chunks Moderate — suitable for most SMB deployments
Qdrant
qdrant/qdrant
Apache 2.0 Low — Docker container; configure URL in Open WebUI settings Yes (v1.3+) Production deployments; metadata filtering; multi-collection; 1M+ chunks High — designed for production-grade vector search
pgvector
pgvector/pgvector
PostgreSQL (permissive) Low if PostgreSQL already deployed; otherwise medium Yes (via config) Organizations already running PostgreSQL; SQL joins on metadata; relational data alongside vectors Good — scales with PostgreSQL; large corpora benefit from HNSW indexing
Weaviate
weaviate/weaviate
BSD 3-Clause Medium — Kubernetes or Docker Compose; schema definition required Yes (via config) Semantic search with rich object schemas; multi-modal; larger teams High — enterprise use; more complex to operate
Milvus
milvus-io/milvus
Apache 2.0 High — distributed cluster architecture; etcd + MinIO dependencies Yes (via config) Billion-scale vector collections; enterprise hashcenter deployments Very high — designed for massive scale; overkill for most SMBs

Source: Open WebUI documentation (docs.openwebui.com, June 2026); individual project GitHub repositories. Features and integration status change with each release — verify against current docs before deploying.

The practical recommendation for most Canadian SMBs: start with ChromaDB (zero additional setup) and migrate to Qdrant when your document corpus exceeds roughly 500,000 chunks or when you need advanced metadata filtering. Both are Apache 2.0-licensed and fully self-hostable.

Best local embedding models for sovereign RAG

The embedding model converts your document chunks and user queries into vectors. For a fully sovereign RAG stack, the embedding model must also run locally — using a cloud embedding API reintroduces the same cross-border transfer risk you are trying to avoid. The following models are served via Ollama and are therefore zero-config in Open WebUI. VRAM figures are approximate and as of June 2026; verify against the current Ollama model library.

Embedding model Approx. memory Vector dimensions Best for
nomic-embed-text
nomic-ai, Apache 2.0
~275 MB RAM (no GPU required) 768 Default choice; CPU-only hardware; fast on any machine
mxbai-embed-large
mixedbread-ai, Apache 2.0
~670 MB RAM 1,024 Higher-quality retrieval; outperforms nomic on MTEB benchmarks; slight CPU overhead
all-minilm
sentence-transformers, Apache 2.0
~45 MB RAM 384 Extremely lightweight; limited hardware; lower retrieval quality at higher dimension counts

Memory figures approximate; sourced from Ollama model library (June 2026) and respective HuggingFace model cards. Embedding models run on CPU by default and do not require GPU VRAM in most Ollama configurations. Verify current figures before planning.

Best LLMs for local RAG by hardware tier

RAG changes the model-selection calculus relative to general assistant use. Because the model receives document context at inference time, raw world-knowledge capacity matters less than instruction-following quality, context window size, and the ability to faithfully synthesize retrieved content without hallucinating. Smaller models tuned for instruction-following can outperform larger general models in RAG tasks. All VRAM figures are for model weights only (approximate); plan 10–20% additional headroom for KV cache and runtime overhead. For complete hardware specifications, see the Local AI Hardware Guide and the VRAM Calculator.

Model Quant Approx. VRAM (weights) Context window RAG verdict Hardware tier
Gemma 4 E4B QAT
Google DeepMind, 2025
QAT/INT4 ~5–6 GB 128K tokens Strong for size. Excellent instruction-following; good synthesis of retrieved chunks. Recommended entry-level RAG model. 8 GB VRAM workstation
Mistral Nemo (12B)
Mistral AI, 2024 — Apache 2.0
Q4 (GGUF) ~7 GB 128K tokens Consistently strong RAG across document types. Trained with retrieval tasks in mind; good at citing sources. 8–12 GB VRAM
Qwen3-27B
Alibaba Cloud, 2025 — Apache 2.0
Q4 (GGUF) ~17 GB 128K tokens Excellent. Strong at long-document RAG, legal and financial text, and multi-document synthesis. Recommended for professional services. 24 GB VRAM workstation
Llama 4 Scout
Meta, 2025 — Custom licence (Meta Llama 4); 109B total (MoE)
INT4 ~55 GB 10M tokens
practical RAG context: varies with hardware
Outstanding context window makes it suited for massive document corpora. Enterprise multi-user RAG deployments. 48–80 GB VRAM (tight at 48)
Qwen3-72B
Alibaba Cloud, 2025 — Apache 2.0
Q4 (GGUF) ~43 GB 128K tokens Top-tier retrieval synthesis quality. Legal, regulatory, and financial RAG at production scale. Multi-user with vLLM. 80 GB VRAM (H100-class)

VRAM figures approximate; sourced from HuggingFace model cards and Ollama library (June 2026). Context window figures per model cards; practical usable context depends on hardware and KV cache configuration. Llama 4 licence terms are distinct from Apache 2.0 — review Meta’s Llama 4 Community Licence before enterprise deployment. All figures are estimates; verify before purchasing.

RAG-specific model-selection guidance

Minimum viable local RAG stack for a Canadian SMB

The following stack runs on a single server inside your office network. No external services. No cloud subscriptions. All components are open-source.

Stack overview — verified open-source components

  • Hardware: GPU workstation with ≥16 GB VRAM (≥24 GB recommended for Qwen3-27B RAG quality). See Local AI Hardware Guide for tier-by-tier specs.
  • OS: Ubuntu 22.04 LTS or 24.04 LTS (recommended for Ollama and Docker support).
  • Inference server: Ollama (MIT licence) — serves both the LLM and the embedding model.
  • RAG interface: Open WebUI (MIT licence) — document upload, chunking, retrieval, and multi-user web UI in a single Docker container.
  • Vector database: ChromaDB (built-in, zero-config) for ≤500K chunks; upgrade to Qdrant (Apache 2.0) when your corpus grows or you need metadata filtering.
  • Embedding model: nomic-embed-text (Apache 2.0) via Ollama — CPU-only, no VRAM consumed.
  • LLM: Qwen3-27B at Q4 (Apache 2.0) via Ollama — 24 GB VRAM; strong RAG synthesis quality for professional services.
  • Access control: Open WebUI built-in user/role system; place behind your existing VPN or firewall rather than exposing to the public internet.

For the full eight-phase deployment walkthrough — from hardware selection through model download, Open WebUI configuration, network hardening, and governance documentation — see the Local LLM Setup Checklist. For the Canadian-specific legal and compliance orientation, see Quebec Law 25 and AI: Why On-Premise LLMs Win. For hardware specifications and tier matching, see Local AI Hardware Guide.

What a Law 25 Section 17 compliance posture looks like

Orientation only, not legal advice. The CAI is the authoritative source on Law 25 compliance. These are structural observations, not a legal opinion.

A Privacy Impact Assessment under Section 17 is not required if there is no communication of personal information outside Quebec. A fully local RAG stack — where ingestion, embedding, vector storage, retrieval, and inference all run on hardware inside Quebec — removes the trigger condition entirely. You may still need:

When cloud RAG is used for non-personal-information tasks, document the decision: what data is involved, why this provider, which jurisdiction applies, and what your assessment concluded. The CAI guidance on PIAs is available at cai.gouv.qc.ca. For the broader intersection of cloud AI and Law 25, see Quebec Law 25 and AI: Why On-Premise LLMs Win and The US CLOUD Act and Canadian AI Data.

Total cost of ownership: local RAG vs cloud RAG

Cloud RAG pricing typically combines per-token costs for the embedding API, per-token costs for LLM inference, and storage costs for the managed vector database. At scale, these three metering lines compound. A team of 20 running a cloud RAG tool for 8 hours a day, five days a week, may find that a one-time hardware purchase amortizes in under 24 months — with zero incremental cost per query thereafter and no per-user seat fees. For a structured side-by-side breakdown, see Cloud vs Local AI: Total Cost of Ownership.

The compliance cost of cloud RAG is harder to meter but real: PIA preparation, legal review of vendor contracts, ongoing monitoring of jurisdictional changes, and the operational risk of a service change or suspension. These costs have no local RAG equivalent.

Frequently asked questions

Does RAG automatically make cloud AI Law 25-compliant?

No — RAG can make cloud AI more accurate, but it does not address the data-transfer risk. With cloud RAG, your documents are sent to a remote embedding API, stored in a cloud vector database, and injected into prompts processed by a remote model. Each step is a potential communication of personal information outside Quebec under Section 17. The compliance question turns on where each component runs and who legally controls that infrastructure, not on whether a retrieval layer is present. This is orientation; confirm your specific situation with counsel or the CAI.

Can I use a Canadian cloud provider for RAG and avoid Law 25 Section 17?

Physical location and legal control are separate questions. A Canadian data centre operated by a US company may still expose data to US CLOUD Act jurisdiction — legal compulsion can reach data held by US entities regardless of where the servers sit. A Privacy Impact Assessment under Section 17 assesses the “legal framework” of the receiving jurisdiction, not just the physical location. A local RAG stack sidesteps this analysis entirely: there is no cross-border communication to assess because there is no cross-border communication. See The US CLOUD Act and Canadian AI Data for the full jurisdictional analysis.

What is the minimum hardware for a useful local RAG system?

For a single user, a machine with 16 GB of system RAM and a CPU-only setup can run nomic-embed-text (embedding, ~275 MB RAM) and a quantized 7B model (approximately 5–6 GB RAM in GGUF format) via Ollama, with ChromaDB for the vector store. Performance will be slow for inference (CPU only), but the stack is functional. For a team of 2–10 users with acceptable response times, a workstation with a 24 GB GPU running Qwen3-27B at Q4 via Ollama + Open WebUI is the recommended entry point. VRAM figures are approximate; verify against the VRAM Calculator for your specific model and context window target.

Does the embedding model also need to run locally?

Yes, for a fully sovereign stack. If you use a cloud embedding API (e.g., OpenAI’s text-embedding-3-small), your document chunks are transmitted to a remote server for conversion — the same cross-border transfer risk as cloud LLM inference. Local embedding models like nomic-embed-text and mxbai-embed-large, served via Ollama, are CPU-only and add negligible load. There is no practical reason to use a cloud embedding API when building a locally-sovereign RAG stack.

Can Open WebUI connect to an existing document management system?

Open WebUI supports document upload through its UI and REST API; documents can be pushed programmatically from other systems via the API. Native connectors to SharePoint, Notion, or similar DMS platforms are not built-in as of June 2026 — check the current Open WebUI documentation for the current integration status. Structured ingest pipelines (e.g., a script that exports files from your DMS nightly and pushes them to Open WebUI’s API) are a common pattern.

Is Open WebUI suitable for a law firm or medical clinic?

Open WebUI provides user accounts, role-based access control, and audit logging — the governance primitives a regulated professional environment needs. The software itself is open-source and self-hostable; your data never leaves your premises. However, Open WebUI is infrastructure, not a compliance certification. You are responsible for configuring access controls correctly, securing the host machine, managing backups, and documenting your AI governance policy. Whether this meets your specific regulatory requirements is a question for your compliance officer and counsel, not a software vendor. The combination of local RAG + Open WebUI + a proper access control policy puts you in a structurally stronger position than cloud RAG — but “in a better position” is not the same as “compliant.”

How does local RAG compare to a cloud provider that offers Quebec data residency?

A cloud provider offering Quebec-resident infrastructure reduces (but may not eliminate) the cross-border transfer risk for stored data. However, Law 25 Section 17’s Privacy Impact Assessment requirement concerns the legal framework of where data is communicated to — including who holds legal control, not only physical location. Additionally, cloud AI service terms can change, services can be suspended, and pricing can increase. Local RAG eliminates all of these variables. For a structured comparison, see Cloud AI Provider Comparison.

What is the difference between RAG and fine-tuning for private document use?

Fine-tuning bakes knowledge from your documents into the model’s weights through additional training. It is compute-intensive, requires technical expertise, and does not handle real-time document updates well — a fine-tuned model is static until you retrain it. RAG retrieves relevant excerpts at inference time without modifying the model; adding a new document to the corpus is as simple as uploading it. For most business RAG use cases — contract review, HR policy Q&A, client file research — RAG is the appropriate architecture. Fine-tuning is better suited to instilling a consistent writing style, domain-specific terminology, or a specialized task format that doesn’t change frequently.

Build your private RAG stack — properly

D-Central designs and ships local AI infrastructure for Canadian organizations that need RAG without the cross-border exposure. From hardware sizing through Open WebUI deployment to access control and governance documentation — we handle the architecture so your data stays where it belongs. Engagements are quoted individually.

Request a local RAG consultation →

Related resources