Pretraining

Sovereign AI

Pretraining is the first and most computationally expensive stage of building a large language model. The model is trained on a vast corpus of text to predict the next token in a sequence, and through that single objective it internalizes grammar, facts, coding patterns, and reasoning structure. The result is a base model, sometimes called a foundation model — the raw material from which every chat assistant, coding helper, and local LLM is subsequently shaped.

Self-supervised by design

Pretraining is described as self-supervised because the labels come from the data itself: for any position in the text, the "correct answer" is simply the token that actually follows. No human annotation is required, which is what allows training on internet-scale corpora measured in trillions of tokens — web text, books, code, and reference material, first passed through a tokenizer that carves it into the subword units the model actually predicts. To predict those tokens well, the model is forced to compress an enormous amount of linguistic and world knowledge into its weights: you cannot reliably guess the next word of a physics explanation, a legal clause, or a Python function without implicitly modelling how physics, law, and Python work. That compression is the whole trick — intelligence-like behaviour emerging from a prediction objective at sufficient scale.

Base models versus assistants

A freshly pretrained base model is a powerful text predictor but not yet a helpful, instruction-following assistant. Ask it a question and it may continue with three more questions, because that is a statistically plausible continuation. Turning it into a usable assistant requires later stages: instruction fine-tuning teaches the request-response format, and preference alignment such as RLHF shapes tone and behaviour. Understanding this split matters for operators: base models offer maximum flexibility and minimal imposed behaviour, while aligned models trade some of that openness for usability and guardrails. It also matters for capability: a model's knowledge ceiling, its in-context learning ability, and its training cutoff date are all fixed at pretraining — everything after merely steers what pretraining built.

Why it matters to the self-hosting operator

Almost nobody self-hosts pretraining — frontier runs consume GPU-months at data-center scale and budgets in the millions. What the sovereign operator inherits is the downstream freedom: when a lab releases open weights, the expensive artifact of pretraining becomes public property, and everything after that point — fine-tuning on your own documents, quantization to fit consumer hardware, offline inference on a machine you control — is within reach of a workshop budget. That asymmetry is the entire economic basis of local AI: the costliest step is done once, openly, and amortized across everyone who runs the weights. Knowing where pretraining ends and adaptation begins tells you exactly what you can change about a model (behaviour, domain knowledge at the margins) and what you cannot (its fundamental capability class) without a data center of your own.

Signals to check on any open model

When evaluating open weights for local use, a few pretraining facts predict more than any benchmark chart: the parameter count (capability ceiling and hardware footprint), the training-token count (how thoroughly that capacity was filled), the data cutoff (what the model cannot know), and the tokenizer's handling of your languages (poorly tokenized languages cost more context and lose nuance). Model cards disclose these unevenly, and the gaps are informative — a release that hides its data recipe is asking for trust it has not earned. For a sovereign operator the habit transfers directly from hardware: read the spec sheet, verify what you can, and size your expectations to what the artifact actually is rather than what the launch post says it is.

Pretraining is the first and most computationally expensive stage of building a large language model. The model is trained on a vast corpus of text…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners