Lambda and Kappa Architecture

Real-time results are fast and approximate. Historical results are slow and accurate. The tension between them is where Lambda and Kappa architecture come from.

Lambda: Two Pipelines#

Lambda runs two parallel systems. The batch layer processes all historical data on a schedule (Spark on HDFS, every few hours) and produces ground truth. The speed layer processes the live stream (Kafka Streams or Flink) for low-latency results. The serving layer merges both: “latest batch result plus stream delta since the last batch.”

The problem is that you write the same business logic twice. Once for Spark, once for Flink. They drift. Someone fixes a bug in one and forgets the other.

graph TD A[Kafka Event Stream] --> B[Speed Layer: Flink] A --> C[Batch Layer: Spark on HDFS] B --> D[Serving Layer] C --> D D --> E[Query Result] style A fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style B fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style C fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style D fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style E fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff

Kappa: One Pipeline#

Jay Kreps (Confluent) proposed eliminating the batch layer entirely. Treat the Kafka log as your source of truth. To reprocess history, replay from offset 0 on a new consumer group. Same code for real-time and historical. No divergence.

Kappa depends on retaining enough history in Kafka. That ties back to log compaction: if you’ve compacted or deleted old events, you can’t replay them.

Lambda still makes sense when reprocessing the full history in a streaming system is too slow, or when your batch jobs use Spark SQL and your streaming equivalent doesn’t have the same expressive power.

At Salesforce#

We had a dashboard showing real-time pipeline stats (last 5 minutes) alongside daily rollups. The original design was Lambda-ish: Spark ran at midnight for daily rollups, a lightweight poller updated the 5-minute view. Two pipelines, slightly different filtering logic.

Daily and real-time numbers differed by 1 to 3% and we spent hours every quarter investigating discrepancies that were “by design.” Moving both to the same Flink job with configurable window sizes eliminated the divergence entirely. The reconciliation overhead effectively went to zero.

The maintenance cost of two pipelines is easy to underestimate until you’re the one getting paged because they disagree.

What I’m Learning#

Kappa sounds cleaner but storage and replay cost for multi-year history gets real quickly. At what point does the operational simplicity of Kappa stop being worth it?