Min-Hash (Minimum Hash) is a locality-sensitive hashing technique that efficiently estimates Jaccard similarity between sets. It enables fast approximate comparison without computing full set intersections.

The Core Insight

Given a random hash function h and two sets A and B:

P(min(h(A)) = min(h(B))) = J(A, B)

The probability that A and B have the same minimum hash value
equals their Jaccard similarity.

This remarkable property allows us to estimate Jaccard similarity by simply comparing hash minima.

Why Does This Work?

Intuitive Explanation

Imagine sorting all elements of A ∪ B by their hash values. The element with the smallest hash could come from:

The probability that the minimum comes from the intersection is:

|A ∩ B| / |A ∪ B| = J(A, B)

Formal Proof

Let h be a uniform random hash function.
Let x* = argminx ∈ A∪B h(x) be the element with minimum hash.

Since h is uniform, each element in A∪B is equally likely to be x*.

P(x* ∈ A ∩ B) = |A ∩ B| / |A ∪ B| = J(A, B)

If x* ∈ A ∩ B, then min(h(A)) = min(h(B)) = h(x*).
Therefore: P(min(h(A)) = min(h(B))) = J(A, B). ∎

Min-Hash Signature

To get a better estimate, we use multiple hash functions:

Given k hash functions h1, h2, ..., hk:

Signature(A) = [min(h1(A)), min(h2(A)), ..., min(hk(A))]

Estimated Jaccard: J'(A,B) = (# matching components) / k

Estimation Accuracy

The estimate is unbiased with variance decreasing as k increases:

E[J'(A,B)] = J(A,B) (unbiased)

Var[J'(A,B)] = J(A,B)(1 - J(A,B)) / k

Standard error ≈ 1/√k
k (hash functions) Standard Error 95% Confidence
4 ±0.50 ±1.00
16 ±0.25 ±0.50
64 ±0.125 ±0.25
256 ±0.0625 ±0.125

Application in SPHDC

In SPHDC, Min-Hash serves a dual purpose:

1. Sparsification

When binding produces more than k elements, Min-Hash selects which elements to keep:

A BIND B = MinHashSelect({ a xor b | a ∈ A, b ∈ B }, k)

From k² candidates, keep only the k with smallest hash values.

This ensures bounded memory while preserving similarity structure.

2. Similarity Preservation

After sparsification, the remaining elements serve as a Min-Hash signature. Sets that were similar before sparsification remain similar after, because they share the same hash minima with probability equal to their original similarity.

Algorithmic Implementation

Basic Algorithm

function minHashSignature(set, k, hashFunctions) {
  const signature = [];
  for (let i = 0; i < k; i++) {
    let minHash = Infinity;
    for (const element of set) {
      const h = hashFunctions[i](element);
      if (h < minHash) minHash = h;
    }
    signature.push(minHash);
  }
  return signature;
}

function estimateJaccard(sigA, sigB) {
  let matches = 0;
  for (let i = 0; i < sigA.length; i++) {
    if (sigA[i] === sigB[i]) matches++;
  }
  return matches / sigA.length;
}

Single-Hash Variant (SPHDC Style)

Instead of multiple hash functions, use a single hash and keep k smallest:

function minHashSelect(set, k, hashFunc) {
  // Sort by hash value
  const sorted = [...set].sort((a, b) =>
    hashFunc(a) - hashFunc(b)
  );
  // Keep k smallest
  return new Set(sorted.slice(0, k));
}

function jaccardFromSets(setA, setB) {
  const intersection = [...setA].filter(x => setB.has(x));
  const union = new Set([...setA, ...setB]);
  return intersection.length / union.size;
}

Locality-Sensitive Hashing (LSH)

Min-Hash is a form of Locality-Sensitive Hashing: similar items hash to the same bucket with high probability.

LSH Property:
If J(A,B) is high → P(same bucket) is high
If J(A,B) is low → P(same bucket) is low

This enables sub-linear nearest neighbor search in large databases.

Historical Context

Min-Hash was developed by Andrei Broder in 1997 at Digital Equipment Corporation (DEC) for the AltaVista search engine.

Original application: detecting near-duplicate web pages by comparing shingle sets (contiguous word sequences).

Trade-offs

Advantages

Disadvantages

Relationship to Other Techniques

Technique Similarity Metric Use Case
Min-Hash Jaccard Set similarity
SimHash Cosine Document vectors
Random Projection Angular distance High-dimensional vectors
Bloom Filter Set membership Fast membership test

Further Reading