Definition
Context compression packs a long input into a much smaller set of learned representations, often called memory slots or soft tokens, that the model conditions on as if they were the original text. Unlike retrieval, which fetches relevant chunks from outside, context compression aims to keep the meaning of the whole input while drastically reducing the number of vectors the model has to process, lowering both latency and the memory cost of attention.
How it works
Methods such as the In-Context Autoencoder train an encoder, frequently the target model itself adapted with a lightweight method, to read a long context and emit a handful of dense slots that a decoder can expand or reason over. Reported compression ratios of around four times are common, with the slots understandable to the target model without retraining it from scratch. Recursive variants chain this process to fold very long inputs into a fixed-size memory.
Where it fits
Context compression is the embedding-space cousin of prompt compression, which instead drops or rewrites actual text tokens. Compression into soft tokens can be denser and more faithful, but it produces representations that are not human-readable and usually requires a training step, whereas text-level pruning stays interpretable and model-agnostic. For a self-hosting operator, compression is attractive when the same long document is reused across many queries, since the dense memory can be computed once and cached.
Compare with prompt compression and see the long context window it helps you afford.
In Simple Terms
Context compression packs a long input into a much smaller set of learned representations, often called memory slots or soft tokens, that the model conditions…
