Service discovery
In a static world you put the address of a dependency in a config file. In a dynamic one — autoscaling groups, containers that restart on new hosts, rolling deploys that cycle every instance — addresses change constantly, and a hardcoded endpoint is a future outage. Service discovery is the layer that answers "where are the live instances of service X right now?". It sounds simple. The hard parts are keeping the answer fresh, deciding who is healthy, and making sure a stale answer fails safe rather than sending traffic to a dead host.
The registry
At the centre of every discovery system is a registry: a store mapping a logical service name to the set of current network locations behind it. Instances register themselves on startup (self-registration) or are registered by the platform (third-party registration, e.g. Kubernetes registering pods). Consumers look up the name and get back the live set.
The registry is itself a distributed system, and a critical one — if it is wrong or unavailable, nothing can find anything. That is why production registries run on a consistent, replicated coordination kernel:
| Registry | Backing store | Consistency |
|---|---|---|
| etcd | Raft | Linearizable; backs Kubernetes |
| Consul | Raft | Strongly consistent, with health checking built in |
| ZooKeeper | Zab | Strongly consistent; ephemeral znodes for liveness |
| Eureka | Replicated, AP | Prefers availability over consistency on purpose |
Client-side vs server-side discovery
Once the registry knows who is alive, something has to use that information to route a request. Two patterns:
- Client-side discovery. The caller queries the registry, gets the list of instances, and load-balances across them itself. No extra network hop, but every client needs discovery logic (a library, or a sidecar like Envoy). Netflix Ribbon + Eureka is the classic example.
- Server-side discovery. The caller sends to a stable virtual address — a
load balancer or proxy — which consults the registry and forwards. Clients stay dumb, but
you add a hop and a component to keep highly available. AWS ALB and Kubernetes
Service(via kube-proxy) work this way.
Service meshes (Istio, Linkerd) blur the line: a per-pod sidecar proxy does server-side-style routing, but it lives next to the client and is fed by the mesh control plane, getting most of the latency benefit of client-side discovery without putting logic in application code.
Health checks and liveness
A registry entry is only useful if it reflects reality. An instance that crashed must leave the set quickly, or callers keep routing to a black hole. Three common mechanisms:
- TTL / heartbeat. The instance must renew its registration before a time-to-live expires; miss the renewal and it is dropped. ZooKeeper's ephemeral nodes are the elegant version: the entry vanishes the instant the session dies.
- Active health checks. The registry (or a load balancer) periodically
probes an endpoint —
GET /healthz, a TCP connect, a script. Consul and most load balancers do this. - Passive / outlier detection. The data-plane proxy watches real traffic and ejects an instance that starts returning errors, without a dedicated probe. Envoy's outlier detection.
Tuning matters. Aggressive checks evict healthy-but-slow instances and amplify a brownout into an outage; lax checks leave dead instances in rotation too long. Distinguish liveness (is the process up?) from readiness (can it serve traffic right now?) — Kubernetes splits these for exactly this reason.
DNS vs a purpose-built control plane
DNS is the oldest service-discovery mechanism and still widely used: a name resolves to a set of A records. It is universal and needs no client library. Its weakness is freshness — DNS caching and TTLs mean clients can hold a dead address for seconds to minutes, and many clients ignore TTLs or cache forever. SRV records add port and weight but do not fix staleness.
Purpose-built control planes (Kubernetes' API server feeding kube-proxy/CoreDNS, a service mesh control plane, Consul's catalog) push updates to data planes within milliseconds and carry richer metadata — version, zone, weight, health. The trade-off is operational complexity: you are now running a control plane that must itself be reliable.
Common misunderstandings
- "The registry is always right." It is eventually right. There is always a window between an instance dying and its entry being removed. Callers need timeouts, retries against other instances, and circuit breakers to survive that window.
- "Use the strongest-consistency registry." A CP registry that refuses to serve during a partition can take your whole platform down with it. Eureka chose AP deliberately so callers keep getting a (stale) list. Match the registry's failure mode to your tolerance.
- "DNS is good enough." Sometimes it is. But DNS caching makes fast failover unreliable; if you need sub-second instance churn, you need a control plane that pushes, not a TTL that clients may ignore.
Further reading
- Richardson — Service registry and discovery patterns — the canonical write-up of client-side vs server-side discovery.
- HashiCorp — Consul service discovery — a registry with integrated health checking, documented end to end.
- Kubernetes — Services and discovery — server-side discovery via the API server, kube-proxy, and CoreDNS.
- Hunt et al. — ZooKeeper — the coordination kernel that ephemeral-node discovery is built on.