Locality-sensitive hashing

Locality-sensitive hashing (LSH) reduces the dimensionality of high-dimensional data. LSH hashes input items so that similar items map to the same “buckets” with high probability (the number of buckets being much smaller than the universe of possible input items). LSH differs from conventional and cryptographic hash functions because it aims to maximize the probability of a “collision” for similar items. Locality-sensitive hashing has much in common with data clustering and nearest neighbor search.

Hashing-based approximate nearest neighbor search algorithms generally use one of two main categories of hashing methods: either data-independent methods, such as locality-sensitive hashing (LSH); or data-dependent methods, such as Locality-preserving hashing (LPH).

An LSH family ${\mathcal {F}}$ is defined for a metric space ${\mathcal {M}}=(M,d)$ , a threshold $R>0$ and an approximation factor $c>1$ . This family ${\mathcal {F}}$ is a family of functions $h:{\mathcal {M}}\to S$ which map elements from the metric space to a bucket $s\in S$ . The LSH family satisfies the following conditions for any two points $p,q\in {\mathcal {M}}$ , using a function $h\in {\mathcal {F}}$ which is chosen uniformly at random:

...
Wikipedia