16 / 20
Topics / 16

Service discovery

In a static world you put the address of a dependency in a config file. In a dynamic one — autoscaling groups, containers that restart on new hosts, rolling deploys that cycle every instance — addresses change constantly, and a hardcoded endpoint is a future outage. Service discovery is the layer that answers "where are the live instances of service X right now?". It sounds simple. The hard parts are keeping the answer fresh, deciding who is healthy, and making sure a stale answer fails safe rather than sending traffic to a dead host.


The registry

At the centre of every discovery system is a registry: a store mapping a logical service name to the set of current network locations behind it. Instances register themselves on startup (self-registration) or are registered by the platform (third-party registration, e.g. Kubernetes registering pods). Consumers look up the name and get back the live set.

The registry is itself a distributed system, and a critical one — if it is wrong or unavailable, nothing can find anything. That is why production registries run on a consistent, replicated coordination kernel:

RegistryBacking storeConsistency
etcdRaftLinearizable; backs Kubernetes
ConsulRaftStrongly consistent, with health checking built in
ZooKeeperZabStrongly consistent; ephemeral znodes for liveness
EurekaReplicated, APPrefers availability over consistency on purpose
The CAP choice shows up here too. etcd, Consul, and ZooKeeper are CP — under a partition they refuse writes rather than serve a possibly-wrong registry. Eureka is AP — it keeps serving a possibly-stale list, betting that a slightly outdated set of endpoints beats no endpoints at all. Neither is wrong; they suit different failure preferences.

Client-side vs server-side discovery

Once the registry knows who is alive, something has to use that information to route a request. Two patterns:

  • Client-side discovery. The caller queries the registry, gets the list of instances, and load-balances across them itself. No extra network hop, but every client needs discovery logic (a library, or a sidecar like Envoy). Netflix Ribbon + Eureka is the classic example.
  • Server-side discovery. The caller sends to a stable virtual address — a load balancer or proxy — which consults the registry and forwards. Clients stay dumb, but you add a hop and a component to keep highly available. AWS ALB and Kubernetes Service (via kube-proxy) work this way.

Service meshes (Istio, Linkerd) blur the line: a per-pod sidecar proxy does server-side-style routing, but it lives next to the client and is fed by the mesh control plane, getting most of the latency benefit of client-side discovery without putting logic in application code.

Health checks and liveness

A registry entry is only useful if it reflects reality. An instance that crashed must leave the set quickly, or callers keep routing to a black hole. Three common mechanisms:

  • TTL / heartbeat. The instance must renew its registration before a time-to-live expires; miss the renewal and it is dropped. ZooKeeper's ephemeral nodes are the elegant version: the entry vanishes the instant the session dies.
  • Active health checks. The registry (or a load balancer) periodically probes an endpoint — GET /healthz, a TCP connect, a script. Consul and most load balancers do this.
  • Passive / outlier detection. The data-plane proxy watches real traffic and ejects an instance that starts returning errors, without a dedicated probe. Envoy's outlier detection.

Tuning matters. Aggressive checks evict healthy-but-slow instances and amplify a brownout into an outage; lax checks leave dead instances in rotation too long. Distinguish liveness (is the process up?) from readiness (can it serve traffic right now?) — Kubernetes splits these for exactly this reason.

DNS vs a purpose-built control plane

DNS is the oldest service-discovery mechanism and still widely used: a name resolves to a set of A records. It is universal and needs no client library. Its weakness is freshness — DNS caching and TTLs mean clients can hold a dead address for seconds to minutes, and many clients ignore TTLs or cache forever. SRV records add port and weight but do not fix staleness.

Purpose-built control planes (Kubernetes' API server feeding kube-proxy/CoreDNS, a service mesh control plane, Consul's catalog) push updates to data planes within milliseconds and carry richer metadata — version, zone, weight, health. The trade-off is operational complexity: you are now running a control plane that must itself be reliable.

Common misunderstandings

  • "The registry is always right." It is eventually right. There is always a window between an instance dying and its entry being removed. Callers need timeouts, retries against other instances, and circuit breakers to survive that window.
  • "Use the strongest-consistency registry." A CP registry that refuses to serve during a partition can take your whole platform down with it. Eureka chose AP deliberately so callers keep getting a (stale) list. Match the registry's failure mode to your tolerance.
  • "DNS is good enough." Sometimes it is. But DNS caching makes fast failover unreliable; if you need sub-second instance churn, you need a control plane that pushes, not a TTL that clients may ignore.

Further reading

Found this useful?