Private RAG for Canadian Businesses: Law 25 Section 17 Risk & Local Stack Guide

The short answer: Cloud RAG feeds your business documents to a remote model — making every retrieved chunk a potential cross-border transfer of personal information subject to Quebec Law 25 Section 17. A local RAG stack (Open WebUI + a vector database such as ChromaDB or Qdrant + a locally-served open-weight model) keeps documents, embeddings, and inference on hardware you own. No transfer occurs, so the cross-border compliance question becomes much simpler. This page explains the risk, walks through the architecture, and maps models to hardware tiers. It is orientation, not legal advice — confirm anything specific with your counsel or the Commission d’accès à l’information (CAI).

Retrieval-Augmented Generation has moved from research experiment to production tool faster than most compliance frameworks expected. Firms that would never paste a client file directly into ChatGPT are now feeding those same files into cloud RAG pipelines without realizing the risk profile is identical — and in some ways sharper, because RAG systems are designed to surface the most relevant content automatically.

For Canadian businesses, particularly those operating under Quebec Law 25, the question is not whether to use RAG. It is where the RAG stack runs.

What RAG is and why it changes the compliance picture

Retrieval-Augmented Generation is a two-stage AI architecture. In the first stage, your documents are split into chunks, converted into numerical vectors by an embedding model, and stored in a vector database. When a user asks a question, the same embedding model converts the query into a vector, the database finds the most semantically similar chunks, and those chunks are injected into the language model’s prompt as context. The model then generates an answer grounded in your actual documents rather than relying on what it learned during training.

RAG is powerful precisely because the model sees your documents at inference time. That is also what makes the compliance picture different from using a general AI assistant with no document context.

RAG component	Cloud RAG — what leaves your walls	Local RAG — what stays in-house
Document ingestion	Documents uploaded to a cloud service for chunking and embedding	Chunked and embedded on your own machine; originals never leave
Embeddings	Sent to a cloud embedding API; chunks may be logged	Generated locally (e.g., nomic-embed-text via Ollama); no API call
Vector store	Managed cloud database; jurisdiction depends on vendor	ChromaDB, Qdrant, pgvector, or Weaviate on your server
Retrieved chunks in prompt	Real document excerpts sent to remote model at inference time	Passed only to the local model; never transmitted
LLM inference	Remote API; subject to provider ToS and applicable law	Runs on local hardware; no third-party API call

Why cloud RAG triggers Quebec Law 25 Section 17

Orientation, not legal advice. The Commission d’accès à l’information (CAI) is the authoritative source. Confirm your specific situation with counsel.

Section 17 of Quebec’s Act respecting the protection of personal information in the private sector (Loi sur la protection des renseignements personnels dans le secteur privé, RLRQ c P-39.1), as modernized by Act 25 (2021, c. 25), requires organizations to conduct a Privacy Impact Assessment (PIA) before communicating personal information outside Quebec. The PIA must assess whether the receiving jurisdiction offers protection equivalent to Quebec’s standards; where it does not, additional contractual safeguards are required.

Three elements of cloud RAG activate this provision:

Documents containing personal information are chunks in your RAG corpus. Client contracts, HR files, patient notes, financial statements — any document with a person’s name, identity, or financial data is personal information under the Act. When those documents are ingested by a cloud RAG service, they cross the provincial border.
Embedding API calls transmit document excerpts. Cloud embedding APIs (including those offered by major US providers) receive document chunks to produce vector representations. Those chunks may contain personal information. The API call is a transfer.
Retrieved chunks appear in the inference prompt. At query time, the RAG system injects the most relevant document excerpts into the prompt sent to the remote model. Those excerpts are personal information in motion, often to a US jurisdiction governed by the CLOUD Act.

The US CLOUD Act of 2018 allows US authorities, with appropriate legal process, to compel a US-based provider to produce data in its custody or control, even when that data physically sits in a Canadian facility. A Quebec-resident server operated by a US company does not resolve the Section 17 question — legal control and physical location are separate concepts. See The US CLOUD Act and Canadian AI Data for a full breakdown.

The practical upshot: if your RAG pipeline touches a US cloud at any stage — ingestion, embedding, or inference — you are communicating personal information outside Quebec and Section 17 applies. A local RAG stack eliminates all three transfer points.

Local RAG architecture: the sovereign stack

A fully local RAG system has five components, all of which can run on a single server or workstation inside your network. The open-source projects that make this possible — Open WebUI, Ollama, ChromaDB, Qdrant, pgvector, llama.cpp — deserve full credit; sovereign AI for ordinary businesses exists because of their work.

Local RAG pipeline (data flow)

Document ingestion: Upload PDFs, Word files, or plain text to Open WebUI. The system chunks documents automatically (configurable chunk size and overlap).
Local embedding: Chunks are converted to vectors by a local embedding model served via Ollama (e.g., nomic-embed-text, mxbai-embed-large). No API call; all computation on your hardware.
Vector store: Embeddings are written to your chosen local vector database (ChromaDB by default in Open WebUI; Qdrant, Milvus, pgvector, or Weaviate via configuration).
Query and retrieval: User question → local embedding → similarity search in vector DB → top-k most relevant chunks returned.
Generation: Retrieved chunks + user question assembled into a prompt; local LLM (via Ollama) generates the answer. Nothing sent externally.

The entire pipeline — ingestion, embedding, storage, retrieval, generation — runs on hardware in your facility. Prompts, document content, and answers never cross your network perimeter.

Open WebUI: the recommended local RAG interface

Open WebUI (formerly Ollama WebUI, open-webui/open-webui) is an open-source, self-hosted web interface for running local LLMs. It ships with built-in RAG support — document upload, chunking, embedding, and retrieval are all handled in the UI without additional configuration tools. It is MIT-licensed; we credit it here as the project that made local RAG accessible to non-engineers.

Key Open WebUI RAG capabilities (as of June 2026 — verify against the project’s official documentation for current status):

Document collections: Upload PDFs, DOCX, TXT, and more; group them into named collections per project or client.
Local embedding models: Configurable to use any Ollama-served embedding model; no cloud embedding API required.
Vector store backends: ChromaDB (default, zero-config), Qdrant, Milvus, Weaviate, OpenSearch, pgvector — selectable in the admin settings.
Web search RAG: Optional integration with local search (can be disabled entirely for air-gapped deployments).
Multi-user access control: Users and roles; documents scoped per user or shared; full audit log of who queried what.
OpenAI-compatible API: Existing applications that send queries to an OpenAI-format endpoint can be re-pointed to Open WebUI without code changes.

Vector database comparison for local RAG

Vector DB	Licence	Setup effort	Open WebUI native	Best for	Scalability ceiling
ChromaDB chroma-core/chroma	Apache 2.0	Minimal — built into Open WebUI; no separate service needed for single-node	Default	Getting started; single-user or small team; <1M document chunks	Moderate — suitable for most SMB deployments
Qdrant qdrant/qdrant	Apache 2.0	Low — Docker container; configure URL in Open WebUI settings	Yes (v1.3+)	Production deployments; metadata filtering; multi-collection; 1M+ chunks	High — designed for production-grade vector search
pgvector pgvector/pgvector	PostgreSQL (permissive)	Low if PostgreSQL already deployed; otherwise medium	Yes (via config)	Organizations already running PostgreSQL; SQL joins on metadata; relational data alongside vectors	Good — scales with PostgreSQL; large corpora benefit from HNSW indexing
Weaviate weaviate/weaviate	BSD 3-Clause	Medium — Kubernetes or Docker Compose; schema definition required	Yes (via config)	Semantic search with rich object schemas; multi-modal; larger teams	High — enterprise use; more complex to operate
Milvus milvus-io/milvus	Apache 2.0	High — distributed cluster architecture; etcd + MinIO dependencies	Yes (via config)	Billion-scale vector collections; enterprise hashcenter deployments	Very high — designed for massive scale; overkill for most SMBs

Source: Open WebUI documentation (docs.openwebui.com, June 2026); individual project GitHub repositories. Features and integration status change with each release — verify against current docs before deploying.

The practical recommendation for most Canadian SMBs: start with ChromaDB (zero additional setup) and migrate to Qdrant when your document corpus exceeds roughly 500,000 chunks or when you need advanced metadata filtering. Both are Apache 2.0-licensed and fully self-hostable.

Best local embedding models for sovereign RAG

The embedding model converts your document chunks and user queries into vectors. For a fully sovereign RAG stack, the embedding model must also run locally — using a cloud embedding API reintroduces the same cross-border transfer risk you are trying to avoid. The following models are served via Ollama and are therefore zero-config in Open WebUI. VRAM figures are approximate and as of June 2026; verify against the current Ollama model library.

Embedding model	Approx. memory	Vector dimensions	Best for
nomic-embed-text nomic-ai, Apache 2.0	~275 MB RAM (no GPU required)	768	Default choice; CPU-only hardware; fast on any machine
mxbai-embed-large mixedbread-ai, Apache 2.0	~670 MB RAM	1,024	Higher-quality retrieval; outperforms nomic on MTEB benchmarks; slight CPU overhead
all-minilm sentence-transformers, Apache 2.0	~45 MB RAM	384	Extremely lightweight; limited hardware; lower retrieval quality at higher dimension counts

Memory figures approximate; sourced from Ollama model library (June 2026) and respective HuggingFace model cards. Embedding models run on CPU by default and do not require GPU VRAM in most Ollama configurations. Verify current figures before planning.

Best LLMs for local RAG by hardware tier

RAG changes the model-selection calculus relative to general assistant use. Because the model receives document context at inference time, raw world-knowledge capacity matters less than instruction-following quality, context window size, and the ability to faithfully synthesize retrieved content without hallucinating. Smaller models tuned for instruction-following can outperform larger general models in RAG tasks. All VRAM figures are for model weights only (approximate); plan 10–20% additional headroom for KV cache and runtime overhead. For complete hardware specifications, see the Local AI Hardware Guide and the VRAM Calculator.

Model	Quant	Approx. VRAM (weights)	Context window	RAG verdict	Hardware tier
Gemma 4 E4B QAT Google DeepMind, 2025	QAT/INT4	~5–6 GB	128K tokens	Strong for size. Excellent instruction-following; good synthesis of retrieved chunks. Recommended entry-level RAG model.	8 GB VRAM workstation
Mistral Nemo (12B) Mistral AI, 2024 — Apache 2.0	Q4 (GGUF)	~7 GB	128K tokens	Consistently strong RAG across document types. Trained with retrieval tasks in mind; good at citing sources.	8–12 GB VRAM
Qwen3-27B Alibaba Cloud, 2025 — Apache 2.0	Q4 (GGUF)	~17 GB	128K tokens	Excellent. Strong at long-document RAG, legal and financial text, and multi-document synthesis. Recommended for professional services.	24 GB VRAM workstation
Llama 4 Scout Meta, 2025 — Custom licence (Meta Llama 4); 109B total (MoE)	INT4	~55 GB	10M tokens practical RAG context: varies with hardware	Outstanding context window makes it suited for massive document corpora. Enterprise multi-user RAG deployments.	48–80 GB VRAM (tight at 48)
Qwen3-72B Alibaba Cloud, 2025 — Apache 2.0	Q4 (GGUF)	~43 GB	128K tokens	Top-tier retrieval synthesis quality. Legal, regulatory, and financial RAG at production scale. Multi-user with vLLM.	80 GB VRAM (H100-class)

VRAM figures approximate; sourced from HuggingFace model cards and Ollama library (June 2026). Context window figures per model cards; practical usable context depends on hardware and KV cache configuration. Llama 4 licence terms are distinct from Apache 2.0 — review Meta’s Llama 4 Community Licence before enterprise deployment. All figures are estimates; verify before purchasing.

RAG-specific model-selection guidance

Context window matters more for RAG than for general assistant use. A model with a 128K-token context window can hold far more retrieved chunks per query than one limited to 4K or 8K tokens — directly improving answer quality when your documents are long or numerous.
Instruction-following quality determines synthesis accuracy. A 7B model that follows retrieval prompts faithfully will outperform a 70B model that ignores the context and hallucinates. Gemma 4 E4B and Mistral Nemo punch above their weight class in RAG evaluations.
For professional services (legal, notary, accounting, HR): Qwen3-27B at Q4 on a 24 GB workstation is the practical sweet spot — strong enough for complex document synthesis, small enough to run on accessible hardware. See the 8-phase setup checklist for the full deployment path.
For multi-user production deployments (10+ concurrent users): Move to vLLM rather than Ollama, and size for at least an 80 GB GPU (H100 class) or a multi-GPU cluster. See AI Sovereignty Consulting for infrastructure sizing.

Minimum viable local RAG stack for a Canadian SMB

The following stack runs on a single server inside your office network. No external services. No cloud subscriptions. All components are open-source.

Stack overview — verified open-source components

Hardware: GPU workstation with ≥16 GB VRAM (≥24 GB recommended for Qwen3-27B RAG quality). See Local AI Hardware Guide for tier-by-tier specs.
OS: Ubuntu 22.04 LTS or 24.04 LTS (recommended for Ollama and Docker support).
Inference server: Ollama (MIT licence) — serves both the LLM and the embedding model.
RAG interface: Open WebUI (MIT licence) — document upload, chunking, retrieval, and multi-user web UI in a single Docker container.
Vector database: ChromaDB (built-in, zero-config) for ≤500K chunks; upgrade to Qdrant (Apache 2.0) when your corpus grows or you need metadata filtering.
Embedding model: nomic-embed-text (Apache 2.0) via Ollama — CPU-only, no VRAM consumed.
LLM: Qwen3-27B at Q4 (Apache 2.0) via Ollama — 24 GB VRAM; strong RAG synthesis quality for professional services.
Access control: Open WebUI built-in user/role system; place behind your existing VPN or firewall rather than exposing to the public internet.

For the full eight-phase deployment walkthrough — from hardware selection through model download, Open WebUI configuration, network hardening, and governance documentation — see the Local LLM Setup Checklist. For the Canadian-specific legal and compliance orientation, see Quebec Law 25 and AI: Why On-Premise LLMs Win. For hardware specifications and tier matching, see Local AI Hardware Guide.

What a Law 25 Section 17 compliance posture looks like

Orientation only, not legal advice. The CAI is the authoritative source on Law 25 compliance. These are structural observations, not a legal opinion.

A Privacy Impact Assessment under Section 17 is not required if there is no communication of personal information outside Quebec. A fully local RAG stack — where ingestion, embedding, vector storage, retrieval, and inference all run on hardware inside Quebec — removes the trigger condition entirely. You may still need:

A policy describing how the RAG system is used and who has access (accountability requirement).
Access logs showing which users queried which document collections (auditability).
Patch and update procedures for the software stack (security obligation).
Data minimization practices — don’t ingest personal information that doesn’t need to be in the RAG corpus.

When cloud RAG is used for non-personal-information tasks, document the decision: what data is involved, why this provider, which jurisdiction applies, and what your assessment concluded. The CAI guidance on PIAs is available at cai.gouv.qc.ca. For the broader intersection of cloud AI and Law 25, see Quebec Law 25 and AI: Why On-Premise LLMs Win and The US CLOUD Act and Canadian AI Data.

Total cost of ownership: local RAG vs cloud RAG

Cloud RAG pricing typically combines per-token costs for the embedding API, per-token costs for LLM inference, and storage costs for the managed vector database. At scale, these three metering lines compound. A team of 20 running a cloud RAG tool for 8 hours a day, five days a week, may find that a one-time hardware purchase amortizes in under 24 months — with zero incremental cost per query thereafter and no per-user seat fees. For a structured side-by-side breakdown, see Cloud vs Local AI: Total Cost of Ownership.

The compliance cost of cloud RAG is harder to meter but real: PIA preparation, legal review of vendor contracts, ongoing monitoring of jurisdictional changes, and the operational risk of a service change or suspension. These costs have no local RAG equivalent.

Frequently asked questions

Does RAG automatically make cloud AI Law 25-compliant?

No — RAG can make cloud AI more accurate, but it does not address the data-transfer risk. With cloud RAG, your documents are sent to a remote embedding API, stored in a cloud vector database, and injected into prompts processed by a remote model. Each step is a potential communication of personal information outside Quebec under Section 17. The compliance question turns on where each component runs and who legally controls that infrastructure, not on whether a retrieval layer is present. This is orientation; confirm your specific situation with counsel or the CAI.

Can I use a Canadian cloud provider for RAG and avoid Law 25 Section 17?

Physical location and legal control are separate questions. A Canadian data centre operated by a US company may still expose data to US CLOUD Act jurisdiction — legal compulsion can reach data held by US entities regardless of where the servers sit. A Privacy Impact Assessment under Section 17 assesses the “legal framework” of the receiving jurisdiction, not just the physical location. A local RAG stack sidesteps this analysis entirely: there is no cross-border communication to assess because there is no cross-border communication. See The US CLOUD Act and Canadian AI Data for the full jurisdictional analysis.

What is the minimum hardware for a useful local RAG system?

For a single user, a machine with 16 GB of system RAM and a CPU-only setup can run nomic-embed-text (embedding, ~275 MB RAM) and a quantized 7B model (approximately 5–6 GB RAM in GGUF format) via Ollama, with ChromaDB for the vector store. Performance will be slow for inference (CPU only), but the stack is functional. For a team of 2–10 users with acceptable response times, a workstation with a 24 GB GPU running Qwen3-27B at Q4 via Ollama + Open WebUI is the recommended entry point. VRAM figures are approximate; verify against the VRAM Calculator for your specific model and context window target.

Does the embedding model also need to run locally?

Yes, for a fully sovereign stack. If you use a cloud embedding API (e.g., OpenAI’s text-embedding-3-small), your document chunks are transmitted to a remote server for conversion — the same cross-border transfer risk as cloud LLM inference. Local embedding models like nomic-embed-text and mxbai-embed-large, served via Ollama, are CPU-only and add negligible load. There is no practical reason to use a cloud embedding API when building a locally-sovereign RAG stack.

Can Open WebUI connect to an existing document management system?

Open WebUI supports document upload through its UI and REST API; documents can be pushed programmatically from other systems via the API. Native connectors to SharePoint, Notion, or similar DMS platforms are not built-in as of June 2026 — check the current Open WebUI documentation for the current integration status. Structured ingest pipelines (e.g., a script that exports files from your DMS nightly and pushes them to Open WebUI’s API) are a common pattern.

Is Open WebUI suitable for a law firm or medical clinic?

Open WebUI provides user accounts, role-based access control, and audit logging — the governance primitives a regulated professional environment needs. The software itself is open-source and self-hostable; your data never leaves your premises. However, Open WebUI is infrastructure, not a compliance certification. You are responsible for configuring access controls correctly, securing the host machine, managing backups, and documenting your AI governance policy. Whether this meets your specific regulatory requirements is a question for your compliance officer and counsel, not a software vendor. The combination of local RAG + Open WebUI + a proper access control policy puts you in a structurally stronger position than cloud RAG — but “in a better position” is not the same as “compliant.”

How does local RAG compare to a cloud provider that offers Quebec data residency?

A cloud provider offering Quebec-resident infrastructure reduces (but may not eliminate) the cross-border transfer risk for stored data. However, Law 25 Section 17’s Privacy Impact Assessment requirement concerns the legal framework of where data is communicated to — including who holds legal control, not only physical location. Additionally, cloud AI service terms can change, services can be suspended, and pricing can increase. Local RAG eliminates all of these variables. For a structured comparison, see Cloud AI Provider Comparison.

What is the difference between RAG and fine-tuning for private document use?

Fine-tuning bakes knowledge from your documents into the model’s weights through additional training. It is compute-intensive, requires technical expertise, and does not handle real-time document updates well — a fine-tuned model is static until you retrain it. RAG retrieves relevant excerpts at inference time without modifying the model; adding a new document to the corpus is as simple as uploading it. For most business RAG use cases — contract review, HR policy Q&A, client file research — RAG is the appropriate architecture. Fine-tuning is better suited to instilling a consistent writing style, domain-specific terminology, or a specialized task format that doesn’t change frequently.

Build your private RAG stack — properly

D-Central designs and ships local AI infrastructure for Canadian organizations that need RAG without the cross-border exposure. From hardware sizing through Open WebUI deployment to access control and governance documentation — we handle the architecture so your data stays where it belongs. Engagements are quoted individually.

Request a local RAG consultation →

Related resources

Related products, repair, and setup paths

Last reviewed June 15, 2026.