Cosine Similarity

Sovereign AI

Cosine similarity measures how alike two vectors are by computing the cosine of the angle between them. The result ranges from 1 (identical direction), through 0 (orthogonal, unrelated), to -1 (opposite direction). It is the most common scoring function used to rank embeddings in a vector database, because it compares the direction of meaning encoded in a vector while ignoring its magnitude. When a retrieval pipeline claims two passages are "semantically close," this — an angle between two arrows in a space with hundreds of dimensions — is almost always the number doing the work.

Relationship to the dot product

Cosine similarity equals the dot product of two vectors divided by the product of their magnitudes. When the vectors are normalised to unit length, the denominator becomes 1, so cosine similarity reduces to a plain dot product. This is why many vector stores normalise embeddings on insert and then use the cheaper dot product internally: the two are mathematically equivalent for unit vectors, and the dot product is faster to compute at scale — a multiply-accumulate per dimension, the kind of arithmetic modern hardware devours. On normalised vectors, Euclidean distance also becomes a simple monotonic function of cosine similarity, so the three standard metrics collapse into one ranking. The practical rule: normalise your embeddings and most metric-choice anxiety evaporates.

Why direction beats distance for text

For text, what matters is the overall content or topic, not how "long" the document's vector is. Two passages can express the same idea at different lengths and intensities; cosine focuses on the shared direction rather than absolute position, making it well suited to semantic search, deduplication, and document classification. Euclidean distance, by contrast, is sensitive to magnitude and is preferred for some clustering tasks where scale genuinely carries information. Intuition helps here: in embedding space, direction encodes what a text is about, so "hashboard voltage fault" and a long forum thread describing the same failure point can point the same way even though the texts share few literal words — which is precisely what lexical keyword search misses and semantic retrieval catches.

Dimensionality is worth a passing respect: intuition about angles comes from two or three dimensions, but embeddings live in hundreds, where almost all random vectors are nearly orthogonal — so even modest positive similarities can be meaningful signals.

At scale, no database compares your query against every stored vector; approximate nearest-neighbour indexes such as HNSW graphs narrow the candidate set first, then rank the survivors by cosine score. The approximation trades a sliver of recall for orders-of-magnitude speed, and it is why a self-hosted vector store on modest hardware can search millions of embeddings in milliseconds.

In a self-hosted retrieval pipeline

Cosine similarity is where the quality of a private RAG stack is quietly decided. Three field notes for the operator. First, match the metric to the model: embedding models are trained with a particular similarity objective, and scoring with a different one degrades ranking — check the model card, and when in doubt, normalise and use cosine. Second, never compare vectors across different embedding models; each model defines its own space, and angles between spaces are meaningless. Third, treat the scores as ordinal, not absolute: a 0.83 is better than a 0.79 from the same model, but thresholds like "keep everything above 0.8" must be calibrated on your own corpus, because raw, untuned similarity scores can mislead. The reward for getting this right is a search-and-retrieval layer that runs entirely on your hardware — your documents embedded, stored, and ranked by a local LLM stack, with nothing sent to a third party to be indexed. Understanding the one formula at the center of it is a small price for owning the whole pipeline.

Cosine similarity measures how alike two vectors are by computing the cosine of the angle between them. The result ranges from 1 (identical direction), through…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners