IV · Distributed systems

Site Reliability Engineering

What it does

Free online from Google. Covers SLOs, error budgets, incident management, post-mortems, capacity planning. The operational chapters of running production at scale.

Who should read it

Anyone who runs systems other people use. Chapters 3, 22, and 24 (cascading failures, addressing overload) are essential alone.