Definition
A data pipeline is an automated workflow that moves data from where it is produced to where it is consumed, applying any cleaning, reshaping, and validation along the way. ETL — Extract, Transform, Load — is the classic three-stage pattern: pull data from source systems, convert it into a consistent usable format in a staging area, then write it to a destination such as a warehouse where analysts and models can use it. It has been a standard in data engineering for decades.
ETL versus ELT
The modern variant reorders the steps. ELT — Extract, Load, Transform — lands raw data directly into the destination first and transforms it later, on demand, inside that store. The only real difference is where transformation happens: in a staging area before loading (ETL) or in the target system after loading (ELT). ETL suits compliance-sensitive workflows where data must be cleaned and masked before it lands; ELT pairs naturally with cheap, scalable storage and has become the default for large-scale analytics.
The backbone of any AI project
For machine learning, the pipeline is where most of the real work lives. Deduplication, filtering, normalization, tokenization, and quality scoring all happen here, and pipeline bugs propagate straight into model behavior. A pipeline that silently drops records or double-counts examples will quietly corrupt training. Treating the pipeline as versioned, testable infrastructure — not a one-off script — is what separates reproducible AI work from fragile experiments.
ELT pipelines commonly load into a data lake, and a pipeline's transform stage often computes the inputs that populate a feature store.
In Simple Terms
A data pipeline is an automated workflow that moves data from where it is produced to where it is consumed, applying any cleaning, reshaping, and…
