The Jaccard similarity coefficient (also called Jaccard index or Intersection over Union) measures similarity between two finite sets. It is defined as the size of the intersection divided by the size of the union.

Definition

For two sets A and B, the Jaccard similarity is:

J(A, B) = |A ∩ B| / |A ∪ B|

Where:
|A ∩ B| = number of elements in both A and B (intersection)
|A ∪ B| = number of elements in A or B or both (union)
A B A∩B

Properties

Property Value Meaning
Range [0, 1] Always between 0 and 1 inclusive
J(A, A) 1 Identical sets have similarity 1
J(A, B) when A ∩ B = ∅ 0 Disjoint sets have similarity 0
Symmetry J(A, B) = J(B, A) Order doesn't matter

Examples

Example 1: Partial Overlap

A = {1, 2, 3, 4}
B = {3, 4, 5, 6}

A ∩ B = {3, 4} → |A ∩ B| = 2
A ∪ B = {1, 2, 3, 4, 5, 6} → |A ∪ B| = 6

J(A, B) = 2/6 = 0.333

Example 2: High Similarity

A = {1, 2, 3, 4, 5}
B = {1, 2, 3, 4, 6}

A ∩ B = {1, 2, 3, 4} → |A ∩ B| = 4
A ∪ B = {1, 2, 3, 4, 5, 6} → |A ∪ B| = 6

J(A, B) = 4/6 = 0.667

Example 3: No Overlap

A = {1, 2, 3}
B = {4, 5, 6}

A ∩ B = {} → |A ∩ B| = 0
A ∪ B = {1, 2, 3, 4, 5, 6} → |A ∪ B| = 6

J(A, B) = 0/6 = 0

Jaccard Distance

The Jaccard distance is the complement of Jaccard similarity:

dJ(A, B) = 1 - J(A, B) = |A △ B| / |A ∪ B|

Where A △ B is the symmetric difference (elements in exactly one set).

Jaccard distance is a proper metric: it satisfies non-negativity, identity, symmetry, and the triangle inequality.

Why Jaccard for HDC?

Jaccard similarity is ideal for set-based HDC strategies like SPHDC because:

1. Bounded Range

Values always between 0 and 1, making it easy to interpret and compare.

2. Captures Structural Overlap

Measures how much two representations share, which aligns with semantic similarity.

3. Efficient Estimation

Can be efficiently estimated using Min-Hash without computing full set intersection.

4. Scale Invariant

The ratio normalizes for set size, so large and small sets can be fairly compared.

Comparison with Other Similarity Measures

Measure Formula Use Case
Jaccard |A∩B| / |A∪B| Set similarity, SPHDC
Dice 2|A∩B| / (|A|+|B|) More weight to overlap
Cosine A·B / (||A|| ||B||) Vector angle similarity
Hamming 1 - (differences/total) Binary vectors, Dense-Binary HDC
Overlap |A∩B| / min(|A|,|B|) Subset detection

Implementation in AGISystem2

In SPHDC, Jaccard similarity between two SPVectors is computed as:

function jaccard(setA, setB) {
  const intersection = new Set(
    [...setA].filter(x => setB.has(x))
  );
  const union = new Set([...setA, ...setB]);

  if (union.size === 0) return 1.0;  // Both empty
  return intersection.size / union.size;
}

Historical Context

Named after Paul Jaccard (1868-1944), a Swiss botanist who introduced the coefficient in 1901 to compare the similarity of sample sets when studying plant distributions in the Swiss Alps.

Original application: comparing floral regions by the ratio of shared species to total species observed.

Applications Beyond HDC

Further Reading