Definition
A data lake is a centralized, highly scalable repository that stores vast volumes of raw data — structured, semi-structured, and unstructured — in its native format. Unlike a traditional database, it imposes no schema at write time. Logs, sensor streams, images, JSON, and tabular exports all land side by side, untransformed, and structure is applied only later when the data is read for a specific purpose.
Schema-on-read versus schema-on-write
The defining property is schema-on-read. A data warehouse cleans and structures data before storing it (schema-on-write), which makes queries fast but forces decisions up front about how data will be used. A data lake inverts this: ingest everything cheaply and decide structure at query time. That flexibility is ideal when you do not yet know every future use — exactly the situation in machine-learning research, where today's irrelevant field is tomorrow's critical feature.
Power and risk
The cost of that flexibility is governance. Without discipline — cataloging, lineage, and quality checks — a data lake degrades into a "data swamp" where nobody trusts or can find anything. For AI work, the lake is usually the staging ground that holds raw inputs before they are filtered and labeled. For a sovereignty-minded builder, a self-hosted lake also means your training corpus lives on infrastructure you control rather than a third-party platform.
Raw data typically arrives in a lake through a data pipeline / ETL process, and curated subsets are later turned into engineered inputs in a feature store.
In Simple Terms
A data lake is a centralized, highly scalable repository that stores vast volumes of raw data — structured, semi-structured, and unstructured — in its native…
