Definition
A train-test split is the practice of dividing a dataset into separate subsets so that a model is evaluated on data it never learned from. It is the primary defense against overfitting — the failure mode where a model memorizes its training examples and then performs poorly on anything new. By holding back a portion of data, you get an honest estimate of how the model will behave in the real world rather than a flattering score on examples it has already seen.
Three subsets, three jobs
Most workflows use three partitions. The training set, the largest share, is what the model learns its parameters from. The validation set guides development decisions — tuning hyperparameters like learning rate and choosing between model variants — without contaminating the final evaluation. The test set is a strict holdout used exactly once, at the end, for an unbiased read on generalization. Common ratios are 70/15/15 or 80/10/10. A model that scores high on training but poorly on validation and test is overfitting.
How splitting goes wrong
The cardinal sin is leakage: letting information from the test set bleed into training, which produces dishonestly high scores that collapse in production. Time-ordered data must be split chronologically, never randomly, or the model effectively peeks at the future. Imbalanced classes need stratified splitting so each subset reflects the real distribution. For anyone training or fine-tuning models on their own hardware, a clean split is what separates a measurable result from wishful thinking.
The data being split is the labeled ground truth / labeled data, which usually arrives through a data pipeline / ETL process before partitioning.
In Simple Terms
A train-test split is the practice of dividing a dataset into separate subsets so that a model is evaluated on data it never learned from.…
