The Jaccard similarity coefficient (also called Jaccard index or Intersection over Union) measures similarity between two finite sets. It is defined as the size of the intersection divided by the size of the union.
For two sets A and B, the Jaccard similarity is:
J(A, B) = |A ∩ B| / |A ∪ B|
|A ∩ B| = number of elements in both A and B (intersection)
|A ∪ B| = number of elements in A or B or both (union)
| Property | Value | Meaning |
|---|---|---|
| Range | [0, 1] | Always between 0 and 1 inclusive |
| J(A, A) | 1 | Identical sets have similarity 1 |
| J(A, B) when A ∩ B = ∅ | 0 | Disjoint sets have similarity 0 |
| Symmetry | J(A, B) = J(B, A) | Order doesn't matter |
A = {1, 2, 3, 4}
B = {3, 4, 5, 6}
A ∩ B = {3, 4} → |A ∩ B| = 2
A ∪ B = {1, 2, 3, 4, 5, 6} → |A ∪ B| = 6
J(A, B) = 2/6 = 0.333
A = {1, 2, 3, 4, 5}
B = {1, 2, 3, 4, 6}
A ∩ B = {1, 2, 3, 4} → |A ∩ B| = 4
A ∪ B = {1, 2, 3, 4, 5, 6} → |A ∪ B| = 6
J(A, B) = 4/6 = 0.667
A = {1, 2, 3}
B = {4, 5, 6}
A ∩ B = {} → |A ∩ B| = 0
A ∪ B = {1, 2, 3, 4, 5, 6} → |A ∪ B| = 6
J(A, B) = 0/6 = 0
The Jaccard distance is the complement of Jaccard similarity:
dJ(A, B) = 1 - J(A, B) = |A △ B| / |A ∪ B|
A △ B is the symmetric difference (elements in exactly one set).
Jaccard distance is a proper metric: it satisfies non-negativity, identity, symmetry, and the triangle inequality.
Jaccard similarity is ideal for set-based HDC strategies like SPHDC because:
Values always between 0 and 1, making it easy to interpret and compare.
Measures how much two representations share, which aligns with semantic similarity.
Can be efficiently estimated using Min-Hash without computing full set intersection.
The ratio normalizes for set size, so large and small sets can be fairly compared.
| Measure | Formula | Use Case |
|---|---|---|
| Jaccard | |A∩B| / |A∪B| | Set similarity, SPHDC |
| Dice | 2|A∩B| / (|A|+|B|) | More weight to overlap |
| Cosine | A·B / (||A|| ||B||) | Vector angle similarity |
| Hamming | 1 - (differences/total) | Binary vectors, Dense-Binary HDC |
| Overlap | |A∩B| / min(|A|,|B|) | Subset detection |
In SPHDC, Jaccard similarity between two SPVectors is computed as:
function jaccard(setA, setB) {
const intersection = new Set(
[...setA].filter(x => setB.has(x))
);
const union = new Set([...setA, ...setB]);
if (union.size === 0) return 1.0; // Both empty
return intersection.size / union.size;
}
Named after Paul Jaccard (1868-1944), a Swiss botanist who introduced the coefficient in 1901 to compare the similarity of sample sets when studying plant distributions in the Swiss Alps.
Original application: comparing floral regions by the ratio of shared species to total species observed.