Definition
Pretraining is the first and most computationally expensive stage of building a large language model. The model is trained on a vast corpus of text to predict the next token in a sequence, and through that single objective it internalizes grammar, facts, coding patterns, and reasoning structure. The result is a base model, sometimes called a foundation model.
Self-supervised by design
Pretraining is described as self-supervised because the labels come from the data itself: for any position in the text, the "correct answer" is simply the token that actually follows. No human annotation is required, which is what allows training on internet-scale corpora measured in trillions of tokens. To predict those tokens well, the model is forced to compress an enormous amount of linguistic and world knowledge into its weights.
Base models versus assistants
A freshly pretrained base model is a powerful text predictor but not yet a helpful, instruction-following assistant. It will happily continue text without regard for intent or safety. Turning it into a usable assistant requires later stages such as instruction fine-tuning and preference alignment. Understanding this split matters for operators: base models offer maximum flexibility and minimal imposed behavior, while aligned models trade some of that openness for usability and guardrails.
Pretraining produces the foundation that later fine-tuning and alignment methods like RLHF refine. It is also where a model's capacity for in-context learning emerges, and it is the stage that determines the knowledge available to a local LLM you run yourself.
In Simple Terms
Pretraining is the first and most computationally expensive stage of building a large language model. The model is trained on a vast corpus of text…
