Embedding Vectors and ANN Search

“Find the 10 most similar items to this one” sounds simple. With millions of items represented as 256-dimensional vectors, exact search is too slow to be useful in production.

What Embeddings Are#

An ML model maps an item (a product, a document, a user’s history) to a dense numeric vector. The geometry of that vector space encodes semantic similarity: similar items land close together. You train the model on interaction data and the embeddings learn to represent “things that users treat similarly.”

A query embedding and a document embedding being close in vector space means they’re likely relevant to each other. This is what powers modern search and recommendation.

Why Exact k-NN Doesn’t Scale#

Finding the exact k nearest neighbors requires computing distance from your query vector to every other vector in the index. With 10 million items and 256 dimensions, that’s a lot of floating point operations per query. At P99 latency requirements (under 50ms), exact search breaks down somewhere in the millions of vectors.

Approximate Nearest Neighbor (ANN) algorithms trade recall for speed. You find items that are probably the closest, not provably the closest. In practice, 95% recall at 10ms beats 100% recall at 500ms for recommendation use cases.

How HNSW Works#

HNSW (Hierarchical Navigable Small World) builds a graph where each node connects to its nearest neighbors. Search starts at a high-level sparse graph (fast navigation), drills down to lower levels (more precise neighbors). It’s a layered skip-list for vector space.

graph TD A[Query Vector] --> B[HNSW Graph Entry Point - Top Layer] B --> C[Navigate to Closest Node at Layer 2] C --> D[Navigate to Closest Node at Layer 1] D --> E[Search Neighborhood at Layer 0] E --> F[Return Approximate k Nearest Neighbors] style A fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style B fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style C fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style D fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style E fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style F fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff

At Oracle#

We had a feature similarity search for telecommunications equipment configurations: given a failing configuration, find the top 5 similar configurations that were resolved. With around 800,000 configuration records and feature vectors of ~100 dimensions, exact search was taking 8-10 seconds per query. Switching to an approximate index (we used an open-source HNSW library) brought it to under 200ms with about 92% recall versus the exact results. The engineers using the tool didn’t notice the 8% miss rate because the recommendations were suggestions, not guarantees.

What I’m Learning#

The recall vs. latency trade-off in ANN is a tunable parameter in most indexes. You can get 99% recall with more graph exploration, or 85% recall with less. The right operating point depends on your product requirements.

Have you tuned ANN recall parameters in production, or treated it as a black box?