Context Window

Sovereign AI

The context window is the maximum amount of tokenized text a large language model can consider at one time. It functions as the model's working memory: anything that falls outside the window simply cannot be seen by the model when it generates its response. The window is measured in tokens — sub-word units produced by the model's tokenizer — not words or characters.

Shared Between Prompt and Response

The window is split between the input you provide and the output the model produces. A model advertised with a 128K-token context can hold that many tokens across the combined prompt and generated reply, so a very long prompt leaves less room for a long answer. When a conversation or document exceeds the limit, earlier material must be summarized, truncated, or retrieved again to stay within bounds.

Why Size Matters

A larger context window lets a model work with longer documents, codebases, or chat histories without compressing them first, and it gives retrieval-augmented systems more room to inject reference passages. As of 2026, frontier models range from around 128K tokens to multi-million-token windows, though longer windows cost more memory and compute per request.

Context windows are the constraint that makes retrieval-augmented generation (RAG) valuable, and window size directly shapes what runs on self-hosted hardware during inference.

Where the Memory Actually Goes

The context window is not free storage — every token held in context has attention keys and values cached for every layer of the model. That KV cache grows linearly with context length and can rival the model weights themselves at long contexts: a model that fits comfortably in VRAM at 4K tokens can exhaust the same card at 64K. This is why runtimes let you set a context limit below the model's advertised maximum, and why quantizing the KV cache is a common lever when serving long conversations on consumer hardware.

Advertised Versus Effective Context

A model accepting 128K tokens is not the same as a model using 128K tokens well. Long-context evaluations consistently show retrieval quality degrading as relevant material sits deeper in a huge prompt, with information in the middle of the window recalled worse than material near the start or end — the widely reported “lost in the middle” effect. Practical systems therefore keep prompts as short as the task allows, place critical instructions at the edges, and retrieve only what is needed rather than dumping whole documents into context.

Budgeting Context on Self-Hosted Hardware

For a local deployment, context length is a knob you trade against model size and speed. A larger model at modest context often beats a smaller model with an enormous window, because quality per token matters more than raw capacity for most tasks. Prompt processing also costs real time: before the first output token appears, every input token must be processed, so feeding 30K tokens of logs into a local model has a noticeable warm-up cost on consumer GPUs. Measure your actual working set — a repair-log summarizer may need 8K, a codebase assistant far more — and size the window to the job instead of the spec sheet.

When a conversation genuinely must outlive the window, the standard strategies are rolling summarization (compress older turns into a short digest that stays in context), a sliding window that keeps only recent turns verbatim, and retrieval — storing the full history outside the model and pulling back only the passages relevant to the current question. Each trades fidelity for capacity in a different way, and mature local chat frontends implement at least one of them.

See VRAM headroom in the GPU–LLM fit dataset.

The context window is the maximum amount of tokenized text a large language model can consider at one time. It functions as the model’s working…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Glossaire du minage

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Comparer les mineurs