Definition
Retrieval evaluation metrics quantify how well a search or retrieval system surfaces the right documents for a query. They are essential for tuning a private retrieval pipeline objectively rather than by gut feel, especially when comparing index settings, embedding models, or a hybrid search configuration. Most are reported "@K", meaning they only consider the top K returned results, and they range from 0 to 1 with higher being better.
Rank-unaware metrics
Precision@K is the fraction of the top K results that are actually relevant; it measures how clean the result list is. Recall@K is the fraction of all relevant documents that appear in the top K; it measures coverage. These two ignore the order within the top K, so a relevant document helps the score equally whether it is ranked first or Kth.
Rank-aware metrics
When position matters, use rank-aware measures. Mean Reciprocal Rank (MRR) rewards getting the first correct result high (rank 1 scores 1.0, rank 2 scores 0.5, and so on), useful when one right answer is enough. Normalized Discounted Cumulative Gain (NDCG) rewards placing multiple relevant documents near the top, discounting lower positions logarithmically, making it the most comprehensive single measure of ranking quality.
For a sovereign operator, these metrics turn retrieval tuning into something you can measure on your own labelled set, so you can confidently improve the semantic search layer feeding a local LLM. Strong retrieval scores are the precondition for a trustworthy RAG answer, since the model can only ground its response on what the retriever surfaces.
In Simple Terms
Retrieval evaluation metrics quantify how well a search or retrieval system surfaces the right documents for a query. They are essential for tuning a private…
