Lost in the Middle

Sovereign AI

Lost in the middle is a documented failure pattern in which a language model retrieves and reasons over information most reliably when it sits near the beginning or end of a long input, and noticeably worse when the relevant fact is buried in the middle. The term comes from a 2023 study by Liu and colleagues that tested multi-document question answering and key-value retrieval while systematically moving the answer's position through the prompt. The result was a characteristic U-shaped accuracy curve: strong at the edges, sagging in the center — a finding that reshaped how practitioners think about long prompts.

The finding landed with force because it contradicted the era's simplest scaling story. Context windows were growing from thousands of tokens toward hundreds of thousands, and the implicit promise was that you could soon paste an entire codebase or document archive into a prompt and let the model handle the rest. Liu and colleagues' experiments punctured that neatly: on tasks where the model demonstrably had every needed fact in context, accuracy still depended sharply on where the fact sat. The paper became one of the most cited practical results in the prompt-engineering literature precisely because it converted a vague suspicion — "long prompts feel unreliable" — into a measured, reproducible curve that anyone could test on their own models.

What the research showed

Performance degraded substantially when the model had to access information located in the middle of a long context, and the effect persisted even in models explicitly built and marketed for long inputs. In other words, a large context window tells you how much text a model can accept, not how evenly it attends to it. The two claims are routinely conflated in marketing, and lost-in-the-middle is the gap between them made measurable. Related evaluations in the needle in a haystack family probe the same weakness by hiding a single fact at varying depths and asking the model to find it.

Why it happens

The bias is positional, rooted in how attention distributes weight across a sequence in a transformer. Training data plausibly reinforces it too: in natural documents, the most important information clusters at beginnings and ends — abstracts, introductions, conclusions — so models learn that the middle is usually padding. Position-encoding schemes and long-context training recipes have chipped away at the effect in newer models, but it remains a sound default assumption, and the only way to know how much a specific model suffers is to test it.

Practical implications

For anyone building retrieval or summarization pipelines on their own hardware, the lesson is blunt: put the material that matters where the model actually reads. Concretely — order retrieved passages by relevance rather than arbitrarily, and consider placing the strongest evidence first and last rather than strictly descending; put key instructions at the top and restate the question at the bottom, bracketing the context; and stop dumping huge undifferentiated blobs into the prompt on the theory that the model will sort it out. In a RAG pipeline this is nearly free to exploit, since your code already controls chunk ordering — a few lines rearranging passages routinely beats upgrading to a bigger model. The corollary is about retrieval quality: fetching fewer, better chunks outperforms stuffing the window, because everything you add pushes something else toward the dead zone.

The local-model angle

The effect matters extra when you run models on your own hardware. Long contexts are computationally expensive — memory for the key-value cache grows with every token, and prefill time stretches — so a home-lab machine pays real electricity and latency for context the model may then half-ignore. Compact local models also tend to show stronger positional bias than frontier ones, making prompt structure a bigger lever precisely where hardware is most constrained. Treat the long context window as an opportunity that rewards deliberate use, not an automatic win: on your own silicon, a well-ordered 4,000-token prompt beats a careless 40,000-token one on accuracy, speed, and power draw all at once.

Lost in the middle is a documented failure pattern in which a language model retrieves and reasons over information most reliably when it sits near…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners