Sparse Polynomial HDC (SPHDC) - Theory

Sparse Polynomial HDC (SPHDC) represents concepts as small finite sets of integers in a virtually infinite space. Instead of fixed-length bit vectors, each concept is a set of k elements drawn from the integers modulo 2⁶⁴.

Read the full theoretical analysis →

1. The Core Idea: Sets as Representations

Fundamental Insight

In SPHDC, a concept is not a binary pattern but a finite set of integers. The "polynomial" in the name comes from viewing each integer as an exponent in a polynomial over GF(2), though the implementation directly uses set operations.

Definition: SPVector

An SPVector V is a finite set of k distinct 64-bit integers:

V = { e₀, e₁, ..., e_k-1 } where e_i ∈ Z_2⁶⁴

Why "Polynomial"?

The name comes from the mathematical equivalence between sets of exponents and polynomials over GF(2):

# A set of exponents {3, 7, 12} represents the polynomial: P(x) = x^3 + x^7 + x^12 (coefficients in GF(2), so + is XOR) # XOR of two sets = XOR of polynomials {3, 7, 12} XOR {5, 7} = {3, 5, 12} (7 cancels: 7 XOR 7 = 0) = x^3 + x^5 + x^12

However, SPHDC uses a different binding operation (Cartesian product) that preserves structure better than pure polynomial XOR.

2. The k Parameter

Definition: k (Set Cardinality)

The parameter k is the number of integers in each SPVector. It controls the trade-off between:

Memory: k × 8 bytes per vector
Binding complexity: O(k²) operations
Information capacity: Higher k = more distinctive representations

Intuition: Think of k as the "vocabulary size" for each concept. With k=4, each concept is described by exactly 4 unique "words" (64-bit integers). More words = richer description, but more computation.

Understanding k Values

k Value	Memory	Binding Ops	Interpretation
k=1	8 bytes	1	Single hash per concept (minimal)
k=2	16 bytes	4	Pair of identifiers
k=4 (default)	32 bytes	16	Balanced: distinct yet compact
k=8	64 bytes	64	Higher fidelity representations
k=16	128 bytes	256	Rich similarity gradients

3. Binding: Cartesian XOR

The binding operation in SPHDC creates associations between concepts using the Cartesian product with XOR.

Definition: Cartesian XOR Binding

Given sets A and B, their binding A BIND B is:

A BIND B = { a xor b | a ∈ A, b ∈ B }

Every element of A is XORed with every element of B, producing |A| × |B| results.

Why Cartesian XOR?

The XOR Cancellation Property

XOR is its own inverse: x xor x = 0. This gives Cartesian XOR binding a crucial property:

(A BIND B) BIND B ≈ A

Binding with B, then binding again with B, approximately recovers A. This enables reversible encoding of structured information.

Example: XOR Cancellation in Action

A = {5, 9}
B = {3, 7}

A BIND B = {5 xor 3, 5 xor 7, 9 xor 3, 9 xor 7}
       = {6, 2, 10, 14}

(A BIND B) BIND B = {6, 2, 10, 14} BIND {3, 7}
              = {6 xor 3, 6 xor 7, 2 xor 3, 2 xor 7, ...}
              = {5, 1, 1, 5, 9, 13, 13, 9, ...}

After removing duplicates and sparsifying: {5, 9, ...}
The original A elements are recovered!

4. Sparsification: Keeping Sets Small

When binding two k-sets, the result has up to k² elements. To maintain constant memory, we sparsify back to k elements.

Definition: Min-Hash Sparsification

Given a set S with |S| > k, select the k elements with the smallest hash values:

sparsify(S, k) = k elements of S with min(hash(e))

Intuition: Min-Hash sampling is like taking a "representative sample" of the set. Elements with the smallest hashes act as "fingerprints" that consistently identify the set, even when the full set is too large to store.

Why Min-Hash Works

Min-Hash has a remarkable property: the probability that two sets share the same minimum hash equals their Jaccard similarity. This means sparsification preserves the essential similarity structure.

5. Similarity: Jaccard Index

Similarity between SPVectors is measured by how much their sets overlap.

Definition: Jaccard Similarity

sim(A, B) = |A ∩ B| / |A ∪ B|

The ratio of shared elements to total unique elements. Range: [0, 1].

Example: Computing Jaccard Similarity

A = {1, 5, 9, 12}      (k=4)
B = {5, 9, 15, 20}     (k=4)

Intersection: A ∩ B = {5, 9}         |A∩B| = 2
Union:        A ∪ B = {1, 5, 9, 12, 15, 20}  |A∪B| = 6

Jaccard:      sim(A, B) = 2/6 = 0.333

6. Creating Concept Vectors

Each concept gets a deterministic SPVector generated from its name.

Algorithm: Deterministic Vector Creation

Hash the concept name to get a seed: seed = DJB2("Dog")
Initialize a PRNG with the seed
Generate k unique random 64-bit integers using the PRNG

Read more about hash functions in HDC →

7. Encoding Relations

Structured knowledge is encoded by binding concepts with position vectors.

Relation Encoding Formula

fact(loves, John, Mary) = Loves BIND (Pos1 BIND John) BIND (Pos2 BIND Mary)

Position vectors (Pos1, Pos2, ...) are pre-defined constants that distinguish argument positions.

8. Querying: The Unbind Operation

To answer queries, we "unbind" the known parts to recover the unknown.

Query Principle

Since XOR-based binding cancels known parts, to find ?who in "loves(?who, Mary)":

candidate = KB BIND Loves BIND (Pos2 BIND Mary)

Then match candidate against vocabulary to find the answer.

Example: Query Resolution

# KB contains: loves(John, Mary) encoded as fact_vector

# Query: Who loves Mary?  (loves(?who, Mary))
partial_query = Loves BIND (Pos2 BIND Mary)

# Unbind from KB
candidate = KB BIND partial_query
         = fact_vector BIND Loves BIND (Pos2 BIND Mary)
         = (Pos1 BIND John) + noise

# Extract Pos1
raw_answer = candidate BIND Pos1 ≈ John + noise

# Match against vocabulary
best_match = argmax(sim(raw_answer, vocab[name]))
           = "John"

9. Mathematical Properties

Key Properties of SPHDC

Approximate XOR cancellation: (A BIND B) BIND B ≈ A (with sparsification noise)
Approximate Commutativity: A BIND B ≈ B BIND A
Determinism: Same inputs always produce same outputs
Bounded Memory: All vectors have exactly k elements
Similarity Preservation: Min-Hash maintains approximate Jaccard distances