Locality-Sensitive Hashing

Locality-Sensitive Hashing (LSH) is the retrieval mechanism that makes AGISystem2 fast. Instead of comparing a query against every stored concept, LSH groups similar vectors into buckets. Only concepts in the same bucket need exact comparison, reducing search from O(n) to approximately O(1).

LSH maps similar vectors to the same bucket. Instead of comparing the query against all 10,000 concepts, we only compare against the 4 concepts in bucket 42. Similar vectors hash to nearby buckets with high probability.

How It Works

LSH uses random hyperplane projections to create hash codes. Each bit of the hash is determined by which side of a random hyperplane the vector falls on. Similar vectors tend to fall on the same side of most hyperplanes, producing similar hash codes and landing in the same or adjacent buckets.

The Retriever maintains multiple hash tables with different random projections. This reduces false negatives: if two similar vectors end up in different buckets in one table, they're likely to share a bucket in another table.

Two-Stage Retrieval

Retrieval happens in two stages:

Candidate selection – Hash the query, fetch all vectors in matching buckets across all tables
Exact ranking – Compute masked L1 distance for candidates, return top-k

This combines LSH's speed (sublinear candidate selection) with exact distance accuracy (precise ranking of candidates).

Implementation

The Retriever component manages LSH tables:

Operation	Description
`index(vector, id)`	Add vector to all hash tables
`query(vector, k)`	Find k nearest neighbors
`remove(id)`	Remove from all tables
`rehash()`	Rebuild tables (after many changes)

Configuration

LSH parameters are set via configuration profiles:

Parameter	auto_test	prod	Effect
numTables	4	16	More tables = fewer false negatives
hashBits	8	12	More bits = smaller buckets
probeRadius	1	2	Check adjacent buckets too

Trade-offs

Speed vs accuracy – More tables improve recall but cost memory and time
Memory vs precision – More hash bits create smaller buckets but more overhead
Build vs query – Indexing is slower than exact search; queries are much faster

AGISystem2 – Locality-Sensitive Hashing

How It Works

Two-Stage Retrieval

Implementation

Configuration

Trade-offs

Related Documentation