Inverted Indexes: How Search Actually Works
You have 50,000 documents. User searches for “connection pooling”. You could scan every document for those words. That’s O(n) per query. At scale, it’s unusable.
An inverted index flips the relationship. Instead of asking “what words are in this document?”, you pre-compute “which documents contain this word?”
The Structure#
For each term, maintain a sorted list of document IDs (a posting list).
public class InvertedIndex {
private final Map<String, TreeSet<Long>> index = new HashMap<>();
public void addDocument(long docId, String content) {
for (String term : tokenize(content)) {
index.computeIfAbsent(term, k -> new TreeSet<>()).add(docId);
}
}
public Set<Long> search(String query) {
String[] terms = tokenize(query);
Set<Long> result = new TreeSet<>(index.getOrDefault(terms[0], Set.of()));
for (int i = 1; i < terms.length; i++) {
result.retainAll(index.getOrDefault(terms[i], Set.of()));
}
return result; // documents containing ALL query terms
}
}
Search “connection pooling”: look up the posting list for “connection” (docs 4, 17, 23, 89), look up “pooling” (docs 17, 42, 89). Intersect. Result: docs 17 and 89. Two hash lookups and a set intersection, regardless of how many total documents exist.
Adding Relevance#
A posting list can store more than just document IDs. Add term frequency (how often the word appears in each document) and you can rank results by relevance. A document mentioning “pooling” 12 times is probably more relevant than one mentioning it once.
This is the foundation of TF-IDF: term frequency (how often in this doc) divided by document frequency (how common across all docs). Common words like “the” appear everywhere and get low scores. Rare, specific terms score high.
Keeping the Index Updated#
Building the index is expensive. Updating it incrementally is cheaper. New document? Tokenize and add entries. Deleted document? Remove from posting lists. Modified document? Remove old entries, add new ones.
For large-scale systems, the index is often built offline in batch (the classic MapReduce use case) and swapped in atomically. Real-time updates go to a small secondary index that gets merged periodically.
At Oracle, we had a NSSF config search that ran SELECT * FROM configs WHERE description LIKE '%keyword%'. Full table scan across 50K+ rows. Built an inverted index table: tokenized config descriptions, stored (term, config_id) pairs with a composite index. Search dropped from 2+ seconds to under 50ms. The trade-off was extra write cost on config updates, but configs changed rarely. Reads were 100x more frequent.
What I’m Learning#
Inverted indexes are one of those data structures that seem obvious in hindsight. Of course you’d pre-compute the reverse mapping. But it took me a while to understand why databases don’t just do this automatically for text columns (answer: the write amplification is massive, and most columns don’t need full-text search).
How do you handle text search in your systems? Full-text index or external search engine?