Social Graphs at Scale: Storing Relationships in MySQL
User A follows User B. Store it in a table. Done?
CREATE TABLE follows (
follower_id BIGINT NOT NULL,
followee_id BIGINT NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (follower_id, followee_id)
);
Simple. Until you need to answer two different questions at scale.
Two Queries, One Table#
“Who does User A follow?” Easy. Primary key starts with follower_id. MySQL seeks directly.
SELECT followee_id FROM follows WHERE follower_id = 12345;
“Who follows User B?” Expensive. No index leads with followee_id. MySQL scans the table.
SELECT follower_id FROM follows WHERE followee_id = 67890;
-- Full table scan on a large table
The fix: a second index.
CREATE INDEX idx_followee ON follows (followee_id, follower_id);
Now both directions are indexed. The table is stored once, but queryable both ways. This is the bidirectional index trick for relationship data.
The Sharding Problem#
You shard the follows table by follower_id. “Who does User A follow?” hits one shard. Fast.
“Who follows User B?” now hits every shard, because B’s followers are spread across all of them. This is the secondary index problem applied to graphs.
Options: maintain a denormalized reverse table (followee_id as partition key), accept scatter queries for the less-frequent direction, or use a dedicated graph storage layer for traversal-heavy queries.
At Oracle, we had a service dependency graph (which services call which). Similar problem. Querying “who depends on Service X?” required scanning all records. We ended up maintaining both directions explicitly: a calls table and a called_by table, kept in sync via CDC. Doubled the storage, eliminated the scatter queries.
Beyond Direct Relationships#
“Show me mutual connections” requires joining the graph with itself. “Friends of friends” is two hops. These queries get expensive fast in relational databases.
-- Mutual connections between User A and User B
SELECT f1.followee_id
FROM follows f1
JOIN follows f2 ON f1.followee_id = f2.followee_id
WHERE f1.follower_id = :userA AND f2.follower_id = :userB;
Works for small datasets. At millions of relationships, this join explodes. Systems that need frequent multi-hop traversals often move to specialized graph databases or pre-compute the results into materialized views.
What I’m Learning#
Relationship data looks relational but behaves differently at scale. The same table needs to be queryable in two opposite directions, which fights against single-key sharding. The choice between maintaining two representations vs accepting scatter queries depends entirely on your read patterns.
If one direction is queried 100x more than the other, optimize for that. If both are equally common, you’re maintaining two copies of the data. There’s no free lunch.
How do you store relationship data? Single table with indexes, or denormalized in both directions?