Chunking

Sovereign AI

Chunking is the step in a retrieval pipeline where long documents are split into smaller passages before they are embedded and stored for search. It is one of the most consequential design choices in a retrieval-augmented generation system: chunks that are too large dilute relevance and waste context budget, while chunks that are too small lose the surrounding meaning needed to answer a question. Reports suggest the gap between a good and a poor chunking strategy can swing retrieval accuracy by tens of percent — which makes chunking the highest-leverage, least glamorous knob in the whole pipeline.

Why splitting is necessary at all

Two constraints force the split. Embedding models can only encode a bounded span of text into a single vector, and a vector that tries to summarise an entire manual becomes a smeared average of many topics — mediocre at matching any specific question. Retrieval works best when each stored vector represents one reasonably focused idea. At the other end, whatever is retrieved must fit, alongside the question and the model's instructions, inside the language model's context window, so passages must be small enough that several can be stacked without overflowing the budget.

Common strategies

Fixed-size chunking splits text every N tokens — simple and fast, but blind to meaning, happily cutting a sentence or a procedure in half. Recursive chunking splits along natural boundaries (headings, paragraphs, then sentences) and is the usual baseline. Semantic chunking groups text by topic coherence so each chunk represents a single idea, at the cost of extra compute at indexing time. Structure-aware variants respect document formats — keeping a table, code block, or step-by-step procedure intact — which matters enormously for technical material. A common practical starting point is roughly 400–512 tokens per chunk with 10–20% overlap, where the overlapping sliding window carries context across boundaries so a fact split between two chunks is not lost. Many pipelines also attach metadata (source document, section heading) to each chunk so the generator can cite where an answer came from.

Why it matters for a self-hosted knowledge base

For a self-hosted knowledge tool — say, answering questions over a library of mining manuals and repair notes with a local model served by Ollama or llama.cpp — chunking determines whether the right passage is even retrievable. A troubleshooting procedure chopped mid-step retrieves as two half-answers; a spec table split from its column headers becomes noise. Good chunks become good embeddings, which produce precise hits at query time. The honest workflow is empirical: index, ask the questions you actually care about, inspect which chunks were retrieved, and adjust size and boundaries until the failures stop. Chunking is cheap to redo; a knowledge base that silently can't find its own facts is expensive to trust.

Chunking also does not work alone. Retrieval quality improves further when vector search is combined with keyword (hybrid) search, when a reranking step re-orders the top candidates by actual relevance to the question, and when retrieved chunks carry enough metadata for the model to cite its sources — each of which can mask or amplify chunking decisions. The sane order of operations is still chunking first: no amount of reranking can surface a fact that was split into incoherent fragments at indexing time, and every downstream stage inherits the boundaries you chose here. Get the chunks right and the rest of the pipeline is tuning; get them wrong and the rest is compensation.

Each chunk is converted to a vector and stored in a vector database, and chunk size is bounded in practice by the embedding model's input limit and the language model's context window — tune it like the engineering parameter it is.

Chunking is the step in a retrieval pipeline where long documents are split into smaller passages before they are embedded and stored for search. It…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners