CI passed. Staging tests passed. You’ve reviewed the code three times. Then you ship to production and something you never predicted breaks at scale.

What Canary Means#

A canary release sends a small fraction of real traffic to the new version before switching everyone over. 1% of users hit v2, 99% hit v1. You watch your metrics. If v2 behaves well, you expand: 5%, then 20%, then 100%. If metrics degrade, you route that 1% back to v1 and investigate without anyone else affected.

The name comes from canary birds in coal mines. The bird was the early-warning system. A small group of users serves the same role in production.

What to Measure Before You Start#

Decide what “behaves well” means before the rollout begins. Error rate, P99 latency, downstream call rates. Set your thresholds. If error rate on v2 exceeds 0.5%, automated rollback. If P99 latency exceeds 400ms, stop the rollout. Making these decisions upfront prevents rationalizing bad metrics in the moment.

graph TD A[Incoming Traffic] --> B[Load Balancer] B -->|99%| C[v1 - stable] B -->|1%| D[v2 - canary] D --> E[Metrics: error rate, P99] E --> F{Within thresholds?} F -->|Yes| G[Expand: 5%, 20%, 100%] F -->|No| H[Roll back canary to 0%] style A fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style B fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style C fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style D fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style E fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style F fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style G fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style H fill:#000000,stroke:#ff0000,stroke-width:2px,color:#fff

Sticky Routing#

Users shouldn’t flip between v1 and v2 on successive requests. A user who sees the new checkout flow should keep seeing it. Hash the user ID to assign them to a version for the duration of the rollout. Same mechanism as feature flag bucketing.

At Oracle#

We rolled out a refactored notification dispatch service to our telecom platform. The first deploy went to 100% of traffic at once. A queue processing bug caused 50% of notifications to be dropped silently. We only found out when customers started calling. After that we introduced canary deploys: 5% of traffic to the new version, watch for 30 minutes, then proceed. A 50% failure rate would have shown up as a metric anomaly within the first few minutes.

What I’m Learning#

Canary rollouts only work if you have good metrics and actual alerting on them. If you can’t automatically detect when the canary is misbehaving, you’re just hoping someone notices. The automation around the rollout matters as much as the rollout strategy itself.

Have you automated canary rollback, or do you still rely on someone watching dashboards?