IV · Distributed systems
Site Reliability Engineering
What it does
Free online from Google. Covers SLOs, error budgets, incident management, post-mortems, capacity planning. The operational chapters of running production at scale.
Who should read it
Anyone who runs systems other people use. Chapters 3, 22, and 24 (cascading failures, addressing overload) are essential alone.