Service Discovery

Your service starts. It gets an IP. Three days later it restarts and gets a different IP. Every service that had the old IP hardcoded is now broken. This is why you need service discovery.

The Problem With Static Config#

In a small system, hardcoding IPs in config files works. Then you move to containers. Containers restart, scale up, scale down. IPs change constantly. You need a way for services to find each other without knowing addresses in advance.

Service discovery solves this with a registry. Services register themselves on startup: name, IP, port, health endpoint. Other services query the registry by name, not by address. The registry returns current, healthy instances.

Client-Side vs Server-Side#

In client-side discovery, the calling service queries the registry and picks an instance itself. Netflix Eureka works this way. The client gets a list of healthy instances and load-balances across them. Downside: every client needs discovery logic, in every language.

In server-side discovery, the client calls a load balancer, which queries the registry internally. The client doesn’t know about discovery at all. AWS ALB with ECS service discovery works this way. Simpler clients, but the load balancer becomes a critical dependency.

graph TD A[Service A starts] --> B[Registers: name, IP, port] B --> C[Service Registry] D[Service B needs Service A] --> E[Query registry for Service A] E --> C C --> F[Returns healthy instances] F --> D G[Health check fails] --> H[Registry removes unhealthy instance] style A fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style B fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style C fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style D fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style E fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style F fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style G fill:#000000,stroke:#ff0000,stroke-width:2px,color:#fff style H fill:#000000,stroke:#ff0000,stroke-width:2px,color:#fff

Health Checks Matter More Than Registration#

Registration is half the problem. Deregistration matters more. If a service crashes without deregistering, the registry still shows it as healthy. Callers route traffic to a dead instance. The registry needs active health checking: poll each registered instance’s health endpoint on an interval, remove ones that stop responding.

At Oracle#

We had a microservices setup where services exchanged IPs via config files. During a planned maintenance window where we rotated 8 pods simultaneously, we spent 40 minutes tracking down which services had stale IPs for which other services. Moving to Kubernetes-native service discovery (DNS-based, where each service gets a stable DNS name regardless of pod IP) eliminated this class of problem entirely.

What I’m Learning#

DNS-based discovery in Kubernetes feels invisible until it breaks. The TTL on DNS records means after a pod dies, callers may still route to the old IP for a few seconds. Combining DNS discovery with client-side retries covers the gap.

What discovery mechanism are you running, and have you hit stale registration issues?