Collaborative Filtering

You don’t know what a user wants. But you know what people like them have wanted. That’s the intuition behind collaborative filtering.

The Two Approaches#

User-based CF finds users similar to you, then recommends what they liked. Item-based CF finds items similar to what you’ve already liked. Item-based is generally more stable because user behavior shifts rapidly (you might buy a couch once), while item similarity changes slowly (a couch is similar to other furniture regardless of who buys it).

Similarity is usually cosine similarity between vectors. Each user or item is represented as a vector of ratings or interactions. Cosine similarity measures the angle between vectors, not their magnitude. Two users who both rate the same items highly (even if one uses 5 stars and the other 4 stars) come out similar.

The Sparsity Problem#

The interaction matrix is enormous and mostly empty. You have 2 million users and 500,000 items. Most users have interacted with fewer than 100 items. The matrix is 99.99% zeros. Computing similarity across that directly is expensive and unreliable.

Matrix factorization (SVD, ALS) compresses the matrix. It finds latent factors: not “this user likes thrillers” explicitly, but hidden dimensions that explain the pattern. The compressed representation makes similarity computation tractable.

graph TD A[User Interaction Matrix] --> B[Matrix Factorization - ALS/SVD] B --> C[User Embeddings] B --> D[Item Embeddings] C --> E[Find Similar Users] D --> F[Find Similar Items] E --> G[Recommend Items Those Users Liked] F --> H[Recommend Items Similar to What User Liked] style A fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style B fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style C fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style D fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style E fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style F fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style G fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style H fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff

At Salesforce#

We had a “related records” widget in the CRM that showed similar cases or opportunities. The first version used naive CF: find other users who viewed this record, see what else they viewed, surface that. Across 2 million records and thousands of concurrent users, the similarity computation at query time was too slow, around 4-5 seconds per widget load.

We precomputed item-item similarity offline on a nightly batch job and stored the top 10 similar items per record in a lookup table. Widget load time dropped to under 100ms. The recommendations were slightly stale (up to 24 hours old) but nobody noticed.

What I’m Learning#

Cold start is the part collaborative filtering can’t solve. A new user has no history, so you can’t find similar users. You fall back to popularity-based recommendations (what most people like) until you have enough signal.

Have you shipped a recommendation system in production, and how did you handle cold start?