You train a model using yesterday’s data. You serve it using today’s data. The feature computation logic is slightly different between the two. The model degrades silently and you spend a week figuring out why.

The Training-Serving Skew Problem#

ML models are trained on offline batches: historical data, features computed via Spark jobs, labels aggregated over time. At serving time, features are computed online: live data, lower latency budget, different code path. If those two paths diverge even slightly, you introduce skew. The model was trained on slightly different inputs than it receives at inference.

Feature stores address this by centralizing the feature computation logic. You define features once. The same logic runs offline for training data and online for serving. The feature store is the single source of truth.

Offline vs Online Features#

Not all features can be precomputed. “User’s purchase count in the last 30 days” is computed offline on a batch schedule. “Items currently in the user’s session” is computed online in real time. Feature stores handle both, routing requests to the right backend: a data warehouse for offline features, a low-latency store (Redis, DynamoDB) for online features.

graph TD A[Feature Definition - written once] --> B[Offline Pipeline - Spark/batch] A --> C[Online Pipeline - real-time computation] B --> D[Training Dataset Store] C --> E[Online Feature Store - Redis] D --> F[Model Training] E --> G[Model Serving] F --> H[Same Features Used in Both] G --> H style A fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style B fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style C fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style D fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style E fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style F fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style G fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style H fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff

Point-in-Time Correctness#

Training data has a subtle trap: feature leakage. You’re training on historical records. If you compute features using data that wasn’t available at the time of the label event (e.g., you use a user’s total purchase count, which includes purchases made after the event you’re predicting), your model learns from the future. It looks great in offline evaluation and falls apart in production.

Point-in-time correct feature retrieval means: given a label at timestamp T, retrieve only features computed from data available before T. Feature stores that support this are significantly more complex but produce trustworthy training sets.

At Salesforce#

We had an ML model for predicting which leads would convert. The offline evaluation showed 78% precision. In production it was around 58%. After three weeks of debugging, the issue was a feature: “number of emails sent to this lead.” Offline, we were computing it over all historical emails (including ones sent after conversion, for converted leads). Online, we only had emails up to the current moment. Leakage. Fixing the training pipeline to use only pre-event data brought production precision to 71%.

What I’m Learning#

Feature freshness matters almost as much as feature correctness. A “user activity score” updated daily is a different feature than the same score updated hourly, and the model should be trained on the same freshness it will receive at serving time.

Have you hit training-serving skew in production? How long before you found it?