Definition
The context window is the maximum amount of tokenized text a large language model can consider at one time. It functions as the model's working memory: anything that falls outside the window simply cannot be seen by the model when it generates its response. The window is measured in tokens — sub-word units produced by the model's tokenizer — not words or characters.
Shared Between Prompt and Response
The window is split between the input you provide and the output the model produces. A model advertised with a 128K-token context can hold that many tokens across the combined prompt and generated reply, so a very long prompt leaves less room for a long answer. When a conversation or document exceeds the limit, earlier material must be summarized, truncated, or retrieved again to stay within bounds.
Why Size Matters
A larger context window lets a model work with longer documents, codebases, or chat histories without compressing them first, and it gives retrieval-augmented systems more room to inject reference passages. As of 2026, frontier models range from around 128K tokens to multi-million-token windows, though longer windows cost more memory and compute per request.
Context windows are the constraint that makes retrieval-augmented generation (RAG) valuable, and window size directly shapes what runs on self-hosted hardware during inference.
See VRAM headroom in the GPU–LLM fit dataset.
In Simple Terms
The context window is the maximum amount of tokenized text a large language model can consider at one time. It functions as the model’s working…
