Definition
Binary quantization compresses an embedding by reducing every float32 dimension to a single bit. The rule is trivial: if a (normalized) value is greater than zero, store a 1; otherwise store a 0. A 1,024-dimensional float vector that occupied 4,096 bytes collapses to 128 bytes, a 32x reduction in memory and disk.
Hamming distance: search in two CPU cycles
The payoff is speed. Once vectors are binary, similarity becomes Hamming distance, the count of differing bits between two bit-strings. Modern CPUs compute this with a POPCOUNT instruction in a couple of cycles, so a first-pass search over millions of documents runs an order of magnitude faster than float math, and entirely on commodity hardware you control.
The rescore step recovers accuracy
Crushing 32 bits down to 1 obviously loses precision, so naive binary search alone ranks poorly. The standard fix, introduced by Yamada et al. in 2021, is a rescore (or rerank) pass: retrieve a generous candidate set using fast binary Hamming search, then re-score only those candidates by comparing the original float32 query against the binary document vectors using a dot product. This two-stage approach preserves roughly 96% of full retrieval quality while keeping the 32x storage win and most of the speed-up. It is one of the highest-leverage tricks for fitting a serious retrieval index onto a self-hosted node.
Binary quantization is a member of the broader family of compression methods used in local search. Compare it with product quantization, which trades a little more compute for finer-grained codes, and with Matryoshka embeddings, which shrink the vector length rather than the per-dimension bit depth.
In Simple Terms
Binary quantization compresses an embedding by reducing every float32 dimension to a single bit. The rule is trivial: if a (normalized) value is greater than…
