14 stages · 163 topics · 91 core

Roadmap

Become a backend engineer.

All fifteen stages. The complete arc — what HTTP is doing, how to pick a database, what makes distributed systems hard, and how to walk into a design interview and not freeze. Start here if you're not sure. Each topic links to a Semicolony deep dive, simulator, or handbook entry where one exists, and to a curated external resource where it doesn't. Follow the arc in order, or jump to wherever you're stuck.

Also available System Design Roadmap → 15 stages, ~80 topics, with an interactive architecture diagram. Also available Frontend Engineer Roadmap → 16 stages, from the browser to frontend system design.

Core (the spine) Recommended (strong upside) Optional (pick if relevant)

Path

Level

Core plus the recommended layer. The optional stops stay hidden — they pay off after you've shipped a couple of production services.

Jump to a stage

01 The internet & HTTP 02 OS & networking foundations 03 Version control 04 Pick a language 05 Design principles & patterns 06 APIs & protocols 07 Databases 08 Caching, layered 09 Web security 10 Testing & quality 11 Containers & orchestration 12 Going horizontal 13 Distributed systems 14 Observability 15 System design: putting it all together

Stage

The internet & HTTP

What actually happens when you type a URL.

Most backend confusion comes from missing this layer. Before you can debug a 502 in production or pick sensibly between gRPC and REST, you need a real picture of what a packet does between your machine and the origin server. DNS, TCP, TLS, then HTTP, and the response coming back the same way.

Core

How the internet works

What a packet actually does between your laptop and a Cloudflare datacenter. Usually four or five autonomous-system hops, sometimes ten when routing gets weird.

Networking curriculum

Internet timeline: how it grew External High Performance Browser Networking (free book) External what is the internet (Cloudflare) External Submarine cable map

Core

DNS

Names become IP addresses through a chain of cached lookups. Runs before every other network thing on the request path, and gets blamed for half the outages it didn't cause.

How DNS works

DNS deep dive Sim DNS resolution simulator External DNS records explained (Cloudflare) External implement DNS in a weekend (Julia Evans) External dig command tutorial

Core

IP addressing

v4 nearly out of space, v6 still half-deployed, CIDR notation everyone fumbles in interviews. The addressing scheme that decides where your packets are even allowed to go.

IP deep dive

BGP & routing Routing protocols External IPv6 articles (Cloudflare blog) External CIDR calculator

Core

HTTP

The protocol every backend service speaks all day. Knowing methods, status codes, and headers cold saves real time when production breaks at midnight.

How HTTP works

Sim HTTP flow simulator External MDN HTTP reference External HTTP status codes (MDN) External evolution of HTTP (Mozilla)

HTTP/2

Many concurrent streams over a single TCP connection. Mostly solves head-of-line blocking, except when TCP itself blocks all of them at once. Still default on most big sites.

Sim HTTP/2 streams simulator

External http2 explained (free book) External RFC 9113: HTTP/2 External HTTP/2 vs HTTP/1.1 (Cloudflare)

HTTP/3 & QUIC

Same idea as HTTP/2 but over UDP, with 0-RTT resumption and no TCP-level head-of-line blocking. Already serving most of YouTube and Facebook.

Sim HTTP/3 + QUIC simulator

QUIC deep dive External HTTP/3 explained (free book) External what is HTTP/3 (Cloudflare)

Core

HTTPS & TLS

The handshake that gives you a shared secret, the certificate chain that proves who the server actually is, and the reason "encrypted" without identity verification would buy you almost nothing.

How HTTPS works

TLS deep dive External what is TLS (Cloudflare) External Illustrated TLS 1.3 handshake External Test your TLS: SSL Labs

Core

CDN basics

A cache and compute layer at hundreds of points of presence near your users. Usually the first thing in front of any modern origin, and the layer that turns 200ms into 20ms for repeat visitors.

How CDNs work

External what is a CDN (Cloudflare) External VCL & edge logic (Fastly)

Stage

OS & networking foundations

Where requests turn into syscalls.

Processes, threads, file descriptors, memory pages. When a service is misbehaving at two in the morning, the engineers who can fix it usually share the same superpower: they know what the kernel is actually doing underneath the runtime.

Core

Processes

The kernel's basic unit of isolation. Each one gets its own address space, file descriptors, and a slot in the scheduler. The shape of every server you'll ever run.

Processes deep dive

External Three Easy Pieces (free book) (OSTEP) External Beej: Unix IPC guide

Core

Threads

Same address space as the parent process, separate stacks. Cheaper to spin up than processes; the source of the hardest bugs you'll ever debug, since memory is shared by default.

Threads deep dive

Thread pools External Linux Programming Interface (book)

Scheduling

How the kernel decides who runs next on a busy CPU. Linux's CFS optimises for fairness; alternative schedulers trade fairness for throughput or latency. The choice starts mattering once you hit core saturation.

Scheduling internals

Sim CPU scheduler simulator External Linux CFS: kernel docs

Core

Memory management

Virtual memory, paging, the allocator. The difference between fixing a memory leak in an afternoon and chasing it through three rewrites is usually how well someone knows this layer.

Memory management

Virtual memory Sim Virtual memory simulator Memory allocation External What every programmer should know about memory (Drepper)

Core

I/O: blocking, non-blocking, async

Three flavours of I/O and the kernel APIs underneath: epoll on Linux, kqueue on BSD, io_uring as the new shiny. Your runtime made one of these choices for you.

I/O internals

External The C10K problem (Kegel) External io_uring by example External What is epoll: explained

File systems

inodes, directories, journaling, fsync. The layer a database's claim to durability actually has to pass through. Read the Postgres fsync-gate story once and it'll stay with you.

File systems

External LWN: kernel filesystems External Postgres fsync gate (postmortem)

Core

IPC: pipes, sockets, shared memory

How two processes on the same machine actually talk to each other. The mechanics that sit underneath every RPC call your service has ever made.

IPC deep dive

External Beej: Unix IPC guide Sockets

Core

Synchronization primitives

Locks, semaphores, atomics, RCU. Each one is a small mistake away from an eight-week debugging session.

Synchronization

Sim Mutex & deadlock simulator External Preshing: lockless programming External Mechanical Sympathy (Martin Thompson)

Core

System calls

The boundary between your program and the kernel. Every read, write, epoll_wait crosses it. Strace will show you the whole conversation in real time.

System calls

Sim Syscall journey simulator External Linux syscall table External strace tutorial

Core

TCP / UDP

TCP buys you ordered, reliable bytes at the cost of three handshakes and head-of-line blocking. UDP just sends. Knowing when each fits, and what each costs, is the layer most network bugs happen at.

How TCP works

Sim TCP handshake simulator Sim TCP congestion control Sim TCP vs UDP simulator UDP deep dive External Beej: Network programming guide

Sockets & terminal basics

Berkeley sockets, signals, file descriptors, chmod. The shell-level OS toolkit.

Sockets

External MIT: The Missing Semester chmod calculator External explainshell.com

Stage

Version control

How code moves through time and across people.

Git, in practice. The basic commands are an afternoon. The mental model of the commit DAG, plus the difference between rebase, cherry-pick, and revert, is what you actually need the day a deploy breaks and you have to rewind a release branch with the team watching.

Core

Git fundamentals

Commits, branches, merges, rebase. The mental model of the DAG.

How Git works

External Pro Git (free book) External Atlassian Git tutorials External Learn Git Branching (interactive)

Core

Branching strategies

Trunk-based vs Git Flow vs GitHub Flow. The cultural choice that shapes your CI.

External trunkbaseddevelopment.com

External GitHub Flow External Git Flow: original (Driessen)

Rebase vs merge

Linear history vs preserved-context. The team's aesthetic choice that's actually about reviewability.

External rebase vs merge (Atlassian)

External Pro Git: rewriting history

Conflict resolution

Three-way merges, rerere, the calm 30 seconds before reaching for git reset.

External Git rerere docs

External Oh Shit, Git!?! External Dangit, Git!?!

Core

Code review practices

What good review looks like. Tone, scope, what to leave for a follow-up.

External code review guide (Google)

External Conventional Comments External Thoughtbot: code review guide

Conventional commits & semantic versioning

A vocabulary every CI tool can read. Cheap to adopt, useful forever.

External Conventional Commits

External Semantic Versioning (SemVer)

Branch · pick one

Pick a language

Depth in one beats breadth in five.

Backend roles ask for fluency in one of these, not all five. Go is the pragmatic default. Rust if you want the safest concurrency story and can afford the steeper curve. Node when sharing code with the web tier matters; Java, Python, and C# still run most of the enterprise stack. Pick the one you can defend in an interview. The others can come later.

Core

Go

The pragmatic choice for backend services. Concurrency, tooling, deploys: all simple.

Go curriculum

Go internals Goroutines & channels Sim Goroutine scheduler simulator External A Tour of Go External Effective Go External Go by Example External Go 100 mistakes (book site)

Core

JavaScript / TypeScript

Node services, the event loop, async/await. The lingua franca of the web tier.

JavaScript curriculum

Event loop deep dive Sim Event loop simulator V8 engine internals External javascript.info External Eloquent JavaScript External TypeScript handbook External Total TypeScript (Matt Pocock)

Rust

Ownership, borrowing, fearless concurrency. Steepest curve, biggest payoff for systems work.

Rust curriculum

Borrow checker internals Async runtime External The Rust Book External Rustlings: exercises External Rust by Example External Jon Gjengset: Crust of Rust (YouTube)

Python

Glue language of the industry. Data, scripts, services with FastAPI or Django.

External Python tutorial: official

External Real Python External Fluent Python (book site) External FastAPI docs External Django tutorial

Java / Kotlin

The most mature backend runtime alive. Spring still runs half the enterprise.

External Java: Oracle tutorials

External Effective Java (book) External Baeldung: Spring guides External Kotlin docs

Stage

Design principles & patterns

How to make code that doesn't rot.

The shared vocabulary every code review runs on. SOLID isn't about reciting five letters. It's about hearing "this violates Open-Closed" in a pull request and seeing what the reviewer means without breaking stride. Same with KISS, YAGNI, DRY, the twelve-factor app, and the patterns book on every senior shelf.

Core

SOLID

Single responsibility, Open-closed, Liskov, Interface segregation, Dependency inversion.

External SOLID (Wikipedia)

External Uncle Bob: The Principles of OOD External Stackify: SOLID examples

Core

KISS, YAGNI, DRY

Three short slogans that prevent more bad code than every senior review combined.

External KISS (Wikipedia)

External YAGNI (Wikipedia) External DRY (Wikipedia) External Martin Fowler: YAGNI

Core

Design patterns

Factory, Strategy, Observer, Adapter. The pattern catalog every senior engineer carries.

External patterns (Refactoring Guru)

External Design Patterns (GoF): reference External SourceMaking: patterns

Core

Refactoring

Small, behaviour-preserving changes that make the next change easy.

External refactoring catalog (Martin Fowler)

External Refactoring Guru: refactoring

Clean architecture / hexagonal

Push the framework to the edges. Test the core without standing up a database.

External clean architecture (Uncle Bob)

External Hexagonal: Alistair Cockburn

Domain-driven design

Bounded contexts, ubiquitous language, aggregates. The vocabulary for any non-trivial domain.

External DDD reference (Evans, free PDF)

External Microsoft Learn: DDD basics External Awesome DDD (curated)

Core

Twelve-Factor App

The cleanest statement of "what makes a service deployable." Still load-bearing in 2026.

External 12factor.net

External Beyond 12-factor (Central)

Software architecture patterns

Layered, event-driven, microkernel, CQRS. The catalog above the design-pattern layer.

External Mark Richards: architecture styles (free O'Reilly)

External cloud architecture patterns (Microsoft)

Stage

APIs & protocols

How services talk to each other.

REST is the default; every modern stack also has at least one of gRPC, GraphQL, or WebSockets in it. Each one's shaped for a different problem. Knowing what each costs in latency, bytes on the wire, and operational surface, plus when to reach for it, is the difference between an architecture that ages well and one that bends under its own weight at year three.

Core

REST

Resources, verbs, status codes. The default for public APIs.

REST deep dive

External REST API design (Microsoft) External REST API tutorial External Roy Fielding's thesis (chapter 5)

Core

JSON

The default wire format. Know the spec, the gotchas, the alternative encodings.

JSON deep dive

External JSON spec: ECMA-404 JSON formatter JSON to Go

Core

gRPC

Schema-first, HTTP/2 transport, streaming. The default for internal service-to-service.

gRPC deep dive

Sim gRPC vs REST simulator External grpc.io: official docs External gRPC core concepts

Core

Protocol Buffers

Schema language for gRPC. Binary, fast, evolvable. The alternative to JSON for internal traffic.

Protobuf deep dive

Sim JSON vs Protobuf simulator External protobuf.dev: official External Protobuf language guide

GraphQL

One endpoint, query language. Perfect for client-driven over-fetch problems.

GraphQL deep dive

External graphql.org: learn External Apollo: odyssey course External GraphQL: official spec

Core

WebSockets

Persistent bidirectional connections. For chat, presence, anything pushed from server.

How WebSockets work

Realtime communication WebSockets & SSE: codex External RFC 6455: WebSocket protocol

Server-Sent Events

One-way streaming over HTTP. Often the right answer when WebSockets are overkill.

SSE deep dive

External Server-Sent Events (MDN)

Core

OpenAPI / Swagger

Schema definition for REST APIs. Generates clients, mocks, docs. Adopt it on day one.

External OpenAPI Initiative

External Swagger: getting started External Spotlight: OpenAPI guide

API versioning

Path vs header vs date. The choice that bites once your API has external consumers.

API versioning

External API versioning (Stripe)

Webhooks

When you need the server to call you back. Stripe, GitHub, Slack all built on this.

Webhooks

External webhooks guide (Stripe) External Standard Webhooks

Core

API authentication

API keys, OAuth tokens, mTLS, signed requests. The first thing every API gateway terminates.

Auth in API design

External what is mTLS (Cloudflare) External signed request signing (SigV4) (AWS)

Core

Idempotency at the API layer

Idempotency keys, exactly-once-from-the-client. Stripe-style retry safety.

Idempotence in distributed systems

External idempotency keys (Stripe) External Brandur: idempotency keys

API best practices

Pagination, filtering, error envelopes, rate-limit headers. The patterns that age well.

Best practices

External Zalando RESTful API guidelines External Microsoft REST API guidelines

Stage

Databases

Where state lives, and what makes it hard.

The choice that's hardest to undo. Know the spectrum from a single Postgres on RDS to a sharded distributed store with secondary indexes. More importantly, know the signals that tell you it's time to move up that spectrum. Migrate too early and you've added complexity for nothing. Migrate too late and you're looking at a six-month outage backlog.

Core

Relational fundamentals

Schemas, keys, joins, normalisation. The model that runs most of the internet, mostly on Postgres and MySQL, and probably still will in ten years.

Databases curriculum

Choosing a database External official tutorial (PostgreSQL) External SQLZoo: interactive SQL

Core

SQL & joins

INNER, LEFT, RIGHT, FULL, CROSS. The difference between them is the difference between answering a question correctly and answering a different question that sounded the same.

Sim SQL JOIN simulator

External SQL for Smarties (Celko): book External Mode: SQL tutorial

Core

Indexes

B-trees, covering indexes, partial indexes. EXPLAIN ANALYZE is the most under-used command in the toolbox; the gap between engineers who run it and engineers who don't is usually a factor of ten in production query times.

Database indexing

Sim B-tree simulator B-tree deep dive Database indexing: handbook External Use The Index, Luke!

Core

Transactions & ACID

Atomicity, consistency, isolation, durability. None of them are free; each one costs latency or throughput. Knowing what each costs is what makes the trade-off conversations productive.

Sim ACID simulator

Transactions External Designing Data-Intensive Applications: ch. 7 External concurrency control (PostgreSQL)

Core

Isolation levels

Read-uncommitted through serializable. Each level prevents a specific class of anomaly and admits another. Most databases default to a level that's weaker than you probably want.

Sim Isolation levels simulator

Isolation levels deep dive Paper A critique of ANSI SQL isolation (Berenson et al.) External consistency models (Jepsen)

MVCC

How Postgres lets readers and writers not block each other. Every senior database interview probes it, and most engineers can describe the mechanism without grasping what it costs at vacuum time.

MVCC deep dive

External Postgres: MVCC chapter External Bruce Momjian: MVCC unmasked

Core

WAL & crash recovery

Write the change to a log, fsync, then update the table. The single mechanism every durable database has agreed on, more or less unchanged since the eighties.

WAL deep dive

Sim WAL recovery simulator How WAL works Paper ARIES: recovery (Mohan et al.)

Storage engines (B-tree vs LSM)

B-tree wins reads; LSM wins writes. The layer beneath your SQL plan, and the choice that decides whether your database is fast at the workload you actually have.

Sim Storage engine simulator

LSM-tree deep dive Paper LSM paper: annotated External Database Internals (Petrov): book

Page cache & buffer pool

Where the OS, the DB, and your hot data argue about who owns memory.

Page cache deep dive

External Linux page cache: kernel docs External Postgres shared buffers: wiki

Query planner

EXPLAIN, EXPLAIN ANALYZE. Cost-based vs rule-based. Why your index is being ignored.

Query planner

Sim SQL query execution simulator External Postgres: using EXPLAIN External PEV: explain visualiser

Core

NoSQL: when

Four shapes: key-value, document, wide-column, graph. Each fits one access pattern very well and the others badly. The rule for picking is access pattern first, feature checklist last.

NoSQL databases

External NoSQL design (AWS) External Cassandra: modeling tutorial

Distributed SQL

Spanner, CockroachDB, YugabyteDB, TiDB. ACID transactions across many nodes, paid for with higher write latency than a single-box Postgres can give you. Usually worth it once you can't fit on one box.

Distributed SQL

Paper Spanner paper: annotated Paper F1 paper: annotated External design doc (CockroachDB)

Search engines

Inverted indexes, TF-IDF, BM25. Elasticsearch, OpenSearch, Solr, Meilisearch, Typesense.

External Elasticsearch: guide

External Apache Solr: docs External Meilisearch: quick start External Lucene: fundamentals

Stage

Caching, layered

From a hashmap to a global edge fabric.

Every fast system caches at four or five layers. Picking which layer to cache at, deciding what your TTL actually means, and recovering when a popular key expires and fifty thousand requests hit the origin in the same second: those are the operational skills that turn a snappy product into one that survives a launch.

Core

Caching strategies

Cache-aside, read-through, write-through, write-behind. Pick a pattern; defend it.

Caching strategies

How caching works External caching strategies (AWS) External caching guidance (Microsoft)

Core

Eviction policies

LRU, LFU, ARC, TinyLFU, W-TinyLFU. Each one has a workload it wins on.

Sim Cache eviction simulator

Sim LRU cache simulator External Caffeine (Java): TinyLFU External cache replacement policies (Wikipedia)

Core

Redis

The de-facto in-memory store. Strings, sets, sorted sets, streams. And why "single-threaded" is fine.

How Redis works

Sim Redis operations simulator External redis.io: documentation External Try Redis: interactive External data structures (Redis)

Core

CDN

The cache at the edge. PoPs, cache headers, invalidation lag.

How CDNs work

External how a CDN works (Cloudflare) External edge platform (Fastly) External AWS CloudFront: docs

Core

HTTP cache headers

Cache-Control, ETag, Vary. The contract between origin and every cache in front of it.

External HTTP caching (MDN)

External RFC 9111: HTTP caching External web.dev: HTTP cache

Core

Stampedes & invalidation

Coalescing, jittered TTLs, negative caches. The two failure modes behind most incidents.

Advanced caching

Sim Thundering herd simulator External Instagram: solving cache stampedes

Stage

Web security

The threats most backend incidents come from.

You won't become a security expert from this page. Security is its own seven-year apprenticeship. The goal here is more modest: know enough of OWASP, OAuth, CORS, JWT, and the basics of crypto to not be the engineer who shipped the bug that landed the company in the press.

Core

OWASP Top 10

The canonical list of web vulnerabilities, refreshed every few years.

External OWASP Top 10: official

External OWASP Cheat Sheet Series External PortSwigger Web Security Academy

Core

Hashing & password storage

bcrypt, scrypt, Argon2. Never roll your own. Salt, iteration cost, the lot.

External password storage cheat sheet (OWASP)

Sim Password hashing simulator External RFC 9106: Argon2 External bcrypt (Wikipedia)

Symmetric & asymmetric crypto

AES, RSA, ECC. What you encrypt with vs what you sign with.

External what is encryption (Cloudflare)

External Crypto 101 (Laurens Van Houtven) External Cryptopals challenges

TLS: beyond the handshake

Cipher suites, certificate validation, mutual TLS, TLS 1.3.

TLS deep dive

External SSL Labs: server test External illustrated TLS 1.3 External server-side TLS guidelines (Mozilla)

Core

OAuth 2.0

The framework every "login with X" rides on. The flows, the tokens, the traps.

How OAuth works

Auth in API design External RFC 6749: OAuth 2 External oauth.net: overview External Aaron Parecki: OAuth tutorial

Core

OpenID Connect

Identity layer on top of OAuth. ID tokens, userinfo, the standard "sign in with Google" flow.

How OIDC works

External OIDC: official spec External Auth0: OIDC intro

Core

JWT

Stateless tokens with claims. Useful, footgunny. The lifecycle is the part to know.

Sim JWT lifecycle simulator

JWT encoder tool External jwt.io: debugger External RFC 7519: JWT

Core

CORS

Why your fetch() sometimes 403s and sometimes works.

Sim CORS preflight simulator

External CORS (MDN) External web.dev: CORS

Core

CSRF, XSS, SQL injection

The three classic web vulns. Each has a one-page mitigation that most teams skip.

External CSRF cheat sheet (OWASP)

External OWASP: XSS cheat sheet External OWASP: SQL injection prevention External PortSwigger: XSS

Core

Security headers

CSP, HSTS, X-Frame-Options, Permissions-Policy. The free defence-in-depth layer.

External security headers (MDN)

External securityheaders.com: scanner External web.dev: CSP

mTLS

Mutual TLS. Both client and server prove identity. The internal-service standard.

External what is mTLS (Cloudflare)

External SPIFFE/SPIRE: workload identity

Core

Rate limiting

Token bucket, leaky bucket, fixed/sliding window. The first thing in front of a public API.

Sim Rate limiter simulator

Playbook Rate limiter: playbook External rate limiters (Stripe)

Stage

Testing & quality

The tests that actually catch bugs.

The test framework matters less than the habits. Two rules go a long way: write the test before the fix, and weight integration tests heavier than unit tests. Unit tests catch the bugs you imagined; integration tests catch the bugs your customers actually find.

Core

Test pyramid

Unit > integration > E2E. The shape that catches the most bugs per second of CI time.

External Vocke (Practical Test Pyramid)

External test pyramid revisited (Honeycomb)

Core

Unit tests

Test one thing. Run them in milliseconds. The bedrock.

Testing in Go (example)

Testing in Rust External Martin Fowler: UnitTest External Kent Beck: Test-Driven Development (book)

Core

Integration tests

Real database, real network, real environment. The tests that catch the real bugs.

External integration test (Martin Fowler)

External Testcontainers: docs External Postgres testing patterns

End-to-end

Slowest, flakiest, most expensive. Necessary for the critical paths.

External official docs (Playwright)

External Cypress: docs External Selenium: docs

Core

Mocking & test doubles

Stubs, spies, mocks, fakes. What each one is for and when each is wrong.

External test doubles (Martin Fowler)

External Martin Fowler: mocks aren't stubs

Property-based testing

Generate the test cases. Catch bugs your hand-written tests never thought of.

External Hypothesis (Python)

External QuickCheck: original (Haskell) External PropEr (Erlang)

Contract testing

Pact, Spring Cloud Contract. Catch API breakages before integration tests run.

External official docs (Pact)

External Martin Fowler: contract test

Core

Load testing

k6, Vegeta, Locust. Find the cliff before production does.

Load testing deep dive

External k6: docs External Locust: docs External Vegeta: repo

Stage

Containers & orchestration

The shape every production deploy ends up in.

Docker is universal; Kubernetes is what most production stacks run on. You can be productive with both at a surface level in a few weeks. Knowing what's happening underneath (namespaces, cgroups, the kubelet, the scheduler) is what saves you the day a pod won't start and the logs aren't helpful.

Core

Containers: under the hood

Namespaces, cgroups, layered filesystems. Why Docker is "not a VM."

How containers work

Sim Container layers simulator External Docker: overview External Liz Rice: containers from scratch (talk)

Core

Dockerfile best practices

Layer caching, multi-stage builds, distroless. The 10-line file that determines a 300 MB image.

External Docker: Dockerfile best practices

External Distroless images (Google) External Snyk: Docker security best practices

Container registries

Docker Hub, GHCR, ECR, Artifact Registry. Anonymous pull rate limits are the most common surprise.

External Docker Hub: docs

External GitHub Container Registry External OCI image spec

Core

Kubernetes basics

Pods, deployments, services, ingress. The minimum to be productive.

Kubernetes curriculum

K8s architecture External Kubernetes: official docs External Kubernetes the Hard Way (Kelsey Hightower)

K8s control plane

API server, etcd, controllers, scheduler, kubelet. The five pieces that keep your declared state real.

Controllers

Scheduler Kubelet API server etcd

K8s networking & ingress

Services, NetworkPolicy, Ingress, Gateway API. CNI plugins do the heavy lifting.

K8s networking

K8s networking primer External Kubernetes Gateway API External learnk8s: networking guides

Core

Helm & Kustomize

Templating vs overlays. Two ways to keep YAML from becoming a 12-thousand-line copy-paste.

External Helm: official

External Kustomize: docs External Helm best practices

Core

CI/CD

GitHub Actions, GitLab CI, ArgoCD, Flux. The pipeline from commit to production.

External GitHub Actions: docs

External GitLab CI: docs External ArgoCD: docs External Flux: docs

Infrastructure as code

Terraform, Pulumi, OpenTofu. The cluster you can recreate from a file.

External Terraform: docs

External OpenTofu (Terraform fork) External Pulumi: docs External Crossplane: docs

Stage

Going horizontal

When one box stops being enough.

The handful of techniques that turn "works on my laptop" into "works at a million requests per second." Load balancing, sharding, replication, autoscaling, edge. Each one solves a real problem and adds three new ones. Knowing the trade-offs is what separates engineers who scale systems from engineers who just add more boxes.

Core

Scaling out vs up

Vertical vs horizontal. When to add a box vs a bigger box.

Scaling out

Monolith limits External scaling patterns (Microsoft)

Core

Load balancing

L4 vs L7, round-robin vs least-connections, sticky sessions, health checks.

How load balancing works

Sim Load balancer simulator Load balancing handbook LB deep dive Paper Maglev paper: annotated

Web servers

Nginx, Apache, Caddy. The 30-year-old layer that still serves 80% of traffic.

External Nginx: official docs

External Apache HTTPD: docs External Caddy: docs External Envoy proxy: docs

Core

Reverse proxy & API gateway

The traffic-shaping layer in front of your services.

API gateway

Reverse proxy External Kong gateway: docs External what is an API gateway (Cloudflare)

Core

Sharding & partitioning

Splitting a database when one node stops fitting. Hardest to undo.

Sharding deep dive

When to shard Sim Database sharding simulator Sim Consistent hashing simulator External Vitess: docs

Core

Replication

Read replicas, primary-replica, multi-leader. Four flavours, four cost profiles.

Replication deep dive

External DDIA: chapter 5 (replication) External Postgres replication: docs

Autoscaling

Reactive vs predictive, scale-out vs scale-in, cold starts.

How autoscaling works

Sim Autoscaling simulator External autoscaling concepts (AWS) External K8s HPA: docs

Service discovery

How services find each other in a dynamic fleet. DNS-based, registry-based, mesh-driven.

Service discovery

External HashiCorp Consul: docs

Core

Capacity planning

Little's Law, queueing primer, back-of-envelope. Turn a request rate into a count of cores.

Capacity planning

Queueing theory How to estimate cost

Stage

Distributed systems

When your single service becomes a cluster.

The point where backend engineering stops being mostly about correctness and starts being about correctness under failure. Consensus, replication, ordering, idempotence. Every senior loop reaches for at least three of these. The return on investment for time spent here is enormous.

Core

CAP & PACELC

When the network partitions, you have to pick between consistency and availability. There is no third option. PACELC extends the trade-off to the no-partition case, where the choice is latency vs consistency.

CAP & PACELC deep dive

Sim CAP theorem simulator External Brewer's original keynote External Daniel Abadi: PACELC

Core

Consensus & Raft

How a cluster of replicas agrees on the next entry in a shared log. Paxos was the original; Raft made the same idea legible enough you can implement it from the paper in a weekend.

Consensus deep dive

Sim Raft simulator Paper Raft paper: annotated External raft.github.io: visualisation Leader election

Core

Quorum reads/writes

The R + W > N rule. Why a write to two of three replicas is enough to guarantee a later read sees it. Shows up in Dynamo, Cassandra, etcd, and roughly every distributed store ever built.

Sim Quorum simulator

Quorum deep dive Paper Dynamo: sloppy quorums

Core

Two-phase commit & sagas

Two-phase commit gives you ACID across services, at the price of blocking when the coordinator fails. Sagas give up the ACID and substitute compensating transactions. Most modern systems use sagas.

2PC & sagas

External microservices.io: Saga pattern Paper Calvin: deterministic transactions

Core

Idempotence

The property that makes a retry safe to send a second time. Every reliable API enforces it through idempotency keys; Stripe's design doc on this is the canonical reference.

Idempotence

External Brandur: idempotency keys External designing idempotent APIs (Stripe)

Time & clocks

Why "now" is hard across machines. Lamport gave us logical clocks; vector clocks extend that to detect concurrent events; Google's TrueTime sidesteps the problem with atomic clocks and GPS in every datacenter.

Time & clocks

Paper Lamport: time, clocks, events External TrueTime: Spanner whitepaper External clocks talk (Martin Kleppmann)

Gossip protocols

Epidemic-style information spread. The substrate of Cassandra, Consul, SWIM.

Gossip protocols

External SWIM paper: Das et al.

Failure detectors

How a cluster figures out who's alive. Phi-accrual, the eventually-strong-S result.

Failure detectors

Sim Split-brain simulator External Phi-accrual paper (Hayashibara) Paper FLP impossibility

CRDTs

Data types that converge regardless of order. Real-time collab without coordination.

Paper CRDT paper: Shapiro et al.

External crdt.tech: index of resources External CRDTs talk (Martin Kleppmann)

Core

Backpressure & retries

The control-loop discipline that prevents one slow consumer from taking down the system.

Backpressure & retries

Sim Circuit breaker simulator Sim Retry strategy simulator External exponential backoff & jitter (AWS)

Core

Message queues

Kafka for the durable log shape, RabbitMQ for routing patterns, SQS for managed simplicity, NATS for low latency, Pulsar for multi-tenancy. The async layer behind nearly every reliable service.

How message queues work

Kafka, as a river When to introduce a queue External official docs (Kafka) External RabbitMQ: tutorials External NATS: docs External Apache Pulsar: docs

Microservices patterns

Saga, outbox, CQRS, event sourcing, service mesh. The shape of every modern stack.

External microservices.io: patterns

External Microservices.io: Saga pattern External Outbox pattern (Confluent) External Building Microservices (Sam Newman)

Core

Async architecture

Events, queues, idempotence, at-least-once. The four delivery semantics and what each costs.

Async architecture

Orchestration & resiliency

Distributed tracing

OpenTelemetry, Jaeger, Zipkin. Following a request across services.

External OpenTelemetry: official

External Jaeger: docs Paper Dapper paper (Google): annotated

Stage

Observability

Knowing what your system is doing, in production.

Three signals (logs, metrics, traces) plus two methodologies (USE for resource saturation, RED for request health). Together they cover most of what an on-call rotation actually needs. The harder skill isn't the tooling; it's instrumenting your service ahead of the incident so the data is already there when you go looking.

Core

The three pillars

Logs (what happened), metrics (how much), traces (how it flowed).

External Achieving Observability (free book) (Honeycomb)

External CNCF Observability Whitepaper

Core

RED method

Rate, Errors, Duration. The three numbers every request-driven service tracks.

RED method deep dive

External Tom Wilkie: The RED method

Core

USE method

Utilisation, Saturation, Errors. The resource-side complement to RED.

USE method deep dive

External USE method (Brendan Gregg)

Core

SLI / SLO / SLA

The numerical contract with your users, and with yourself.

External Google SRE book: SLOs

External Google SRE workbook: SLOs External SLO guide (Datadog)

Core

Prometheus & Grafana

The de-facto OSS metrics stack. PromQL, exporters, recording rules.

External Prometheus: official

External Grafana: docs External PromQL tutorial: Robust Perception

Core

OpenTelemetry

The vendor-neutral standard for traces, metrics, and logs. Auto-instrumentation.

External OpenTelemetry: official

External CNCF: OpenTelemetry intro

Log aggregation

ELK, Loki, OpenSearch. Centralised logs are the cheapest debugging upgrade you can buy.

External Elastic Stack: guide

External Grafana Loki: docs External Vector: log pipeline

APM & error tracking

Datadog, New Relic, Sentry. The hosted layer that often saves you a Friday night.

External Sentry: official docs

External APM docs (Datadog) External APM docs (New Relic) External docs (Honeycomb)

Profiling

pprof, perf, flame graphs. Finding the hot path is half the work.

Profiling deep dive

External flame graphs (Brendan Gregg) External pprof: docs

Latency budgets

p50 vs p99, the math that turns "make it faster" into a concrete number.

Latency budgets

Paper Tail at scale: Dean & Barroso External Latency numbers every programmer should know

Stage

System design: putting it all together

The interview, and the day job.

Reading about components and designing with them are different skills. You build the second one the same way you'd learn chess: work through the canonical problems out loud: chat, feed, URL shortener, object storage. Defend every choice. The book to read first is <em>Designing Data-Intensive Applications</em>.

Core

The design framework

Six steps in order: scope, estimate, API, data, high-level design, deepen. The repeatable forty-five-minute pass you can do in your sleep after enough practice. That fluency is the goal before the interview, not during.

Design framework

External System Design Primer (donnemartin) External ByteByteGo: system design newsletter

Core

Capacity planning

Little's Law, queueing primer, back-of-envelope. Turn a request rate into a count of cores.

Capacity planning

Queueing theory How to estimate cost

Core

Worked problem: Chat

Designing a Discord-shape system. WebSocket fleet for connections, message store for history, presence, delivery semantics, group fan-out. Every interesting distributed-systems trade-off shows up at least once.

Playbook Chat playbook

How WebSockets work External billions of messages (Discord)

Core

Worked problem: News feed

Fan-out on write is fast to read but expensive to write; fan-out on read is the opposite; the celebrity problem breaks the naive version of either. Most real systems use the hybrid Twitter pioneered around 2013.

Playbook News feed playbook

External Twitter timelines at scale (2013) External TAO paper (Facebook social graph)

Core

Worked problem: URL shortener

The Hello World of distributed systems. Base62 keys, collision resistance, a CDN in front, the redirect path under ten milliseconds. Looks simple; gets interesting once the traffic is real.

Playbook URL shortener playbook

Distributed IDs

Core

Worked problem: Object storage (S3)

What it takes to deliver eleven nines of durability at exabyte scale. Erasure coding instead of replication, metadata and data planes split, multipart upload for big objects, repair scanners running constantly in the background.

Playbook Object storage playbook

External Andy Warfield: Building S3 (re:Invent 2023)

Annotated classics

The dozen or so papers every senior engineer should have read once. Dynamo, Spanner, MapReduce, Bigtable, GFS, Lamport on time and clocks. Each introduced an idea that became the production default a decade later.

Paper Annotated papers

Paper Dynamo Paper Spanner Paper MapReduce Paper Bigtable Paper GFS Paper Time, clocks, events (Lamport)

Engineering blogs to follow

The case studies that drove every architectural pattern. Free.

External case studies (High Scalability)

External Netflix Tech Blog External Uber Engineering External LinkedIn Engineering External Meta Engineering External Cloudflare blog External AWS Builders' Library

Core

The book to read first

If you buy one technical book this year, make it Designing Data-Intensive Applications. Kleppmann distilled twenty years of distributed-systems research into something a working engineer can read on a long-haul flight.

External DDIA (Martin Kleppmann)

External Software Engineering at Google (free) External Google SRE book (free) External Google SRE workbook (free)

Practice mock interviews

Reading is not the same as defending. Pair up; rotate; speak aloud.

External Pramp: free mock interviews

External Interviewing.io: peer mocks External System Design Interview vol. 1 + 2 (Alex Xu)

Practice the worked problems

19 system-design walkthroughs: chat, feed, URL shortener, Twitter, Instagram, Netflix, object storage, rate limiter, more.

Open the playbook

Hands-on

Run the simulators

49 interactive simulators (Raft, CAP, sharding, caching, sorting, container layers), all in the browser.

Browse simulators

Decisions

The handbook

Twelve decision rules: when to shard, when to introduce a queue, how to estimate cost, what to cache.

Read the handbook

Drill

Interview prep

Time-boxed practice rounds: a 45-minute system-design simulator and a hundred concept flashcards across six categories. The endpoint of the roadmap.

Start a round

Sideways

Topics index

A concept index. Pick a topic, see every page that covers it. Useful when you want to drill on one concept across guides, simulators, and papers.

Browse topics