Min-Hash (Minimum Hash) is a locality-sensitive hashing technique that efficiently estimates Jaccard similarity between sets. It enables fast approximate comparison without computing full set intersections.
Given a random hash function h and two sets A and B:
P(min(h(A)) = min(h(B))) = J(A, B)
This remarkable property allows us to estimate Jaccard similarity by simply comparing hash minima.
Imagine sorting all elements of A ∪ B by their hash values. The element with the smallest hash could come from:
The probability that the minimum comes from the intersection is:
|A ∩ B| / |A ∪ B| = J(A, B)
To get a better estimate, we use multiple hash functions:
The estimate is unbiased with variance decreasing as k increases:
| k (hash functions) | Standard Error | 95% Confidence |
|---|---|---|
| 4 | ±0.50 | ±1.00 |
| 16 | ±0.25 | ±0.50 |
| 64 | ±0.125 | ±0.25 |
| 256 | ±0.0625 | ±0.125 |
In SPHDC, Min-Hash serves a dual purpose:
When binding produces more than k elements, Min-Hash selects which elements to keep:
A BIND B = MinHashSelect({ a xor b | a ∈ A, b ∈ B }, k)
This ensures bounded memory while preserving similarity structure.
After sparsification, the remaining elements serve as a Min-Hash signature. Sets that were similar before sparsification remain similar after, because they share the same hash minima with probability equal to their original similarity.
function minHashSignature(set, k, hashFunctions) {
const signature = [];
for (let i = 0; i < k; i++) {
let minHash = Infinity;
for (const element of set) {
const h = hashFunctions[i](element);
if (h < minHash) minHash = h;
}
signature.push(minHash);
}
return signature;
}
function estimateJaccard(sigA, sigB) {
let matches = 0;
for (let i = 0; i < sigA.length; i++) {
if (sigA[i] === sigB[i]) matches++;
}
return matches / sigA.length;
}
Instead of multiple hash functions, use a single hash and keep k smallest:
function minHashSelect(set, k, hashFunc) {
// Sort by hash value
const sorted = [...set].sort((a, b) =>
hashFunc(a) - hashFunc(b)
);
// Keep k smallest
return new Set(sorted.slice(0, k));
}
function jaccardFromSets(setA, setB) {
const intersection = [...setA].filter(x => setB.has(x));
const union = new Set([...setA, ...setB]);
return intersection.length / union.size;
}
Min-Hash is a form of Locality-Sensitive Hashing: similar items hash to the same bucket with high probability.
This enables sub-linear nearest neighbor search in large databases.
Min-Hash was developed by Andrei Broder in 1997 at Digital Equipment Corporation (DEC) for the AltaVista search engine.
Original application: detecting near-duplicate web pages by comparing shingle sets (contiguous word sequences).
| Technique | Similarity Metric | Use Case |
|---|---|---|
| Min-Hash | Jaccard | Set similarity |
| SimHash | Cosine | Document vectors |
| Random Projection | Angular distance | High-dimensional vectors |
| Bloom Filter | Set membership | Fast membership test |