Linux & the shell
Filesystems, permissions, processes, and the pipe-and-grep toolchain. Every instance you launch boots into this; every weird production behaviour eventually drops you back here.
Operating systems codexThe full arc — from Linux and the network stack up through cloud primitives, VPCs, containers, infrastructure as code, observability, security, cost, and the architecture patterns that survive a region failure. Every stage is on the critical path. Each topic links to a Semicolony deep dive or simulator where one exists, and to a curated external resource where it doesn't. Follow the arc in order, or jump to wherever you're stuck.
Core plus the recommended layer. The optional stops stay hidden until you have shipped a couple of real systems.
The box, the wire, and the protocols on top.
The cloud is other people's Linux behind an API. Before the consoles and the acronyms, get fluent in the machine itself — processes, the shell, and the TCP/DNS/TLS stack that every request you will ever debug rides on.
Filesystems, permissions, processes, and the pipe-and-grep toolchain. Every instance you launch boots into this; every weird production behaviour eventually drops you back here.
Operating systems codexHandshakes, ports, timeouts, retransmits, and the request/response cycle on top. When a service "is slow", the answer is almost always somewhere in these two layers.
How TCP worksRecords, TTLs, resolvers, and caching. Cloud platforms lean on DNS for service discovery, failover, and traffic shifting — a stale record takes you down for exactly the cache lifetime.
How DNS worksCertificates, the handshake, and the chain of trust. Everything in the cloud talks TLS; expired certs and broken chains are a recurring outage genre you should be able to diagnose on sight.
How HTTPS worksKey pairs, agents, jump hosts, and tunnels. Knowing exactly what happens when you SSH into a box is the baseline this whole roadmap builds from.
External OpenSSH manual pagesCompute, storage, and identity — the three you build everything from.
Strip away the hundred-service catalog and the cloud is three things: machines you rent, bytes you store, and an identity system deciding who may touch which. Learn these provider-agnostic, with AWS as the running example — the model ports.
Regions, availability zones, the API behind the console, and what the provider is actually selling you. The shared vocabulary every later stage assumes.
Cloud codexInstances, machine images, autoscaling groups, and functions that only exist while they run. The spectrum from "a box you manage" to "a handler you upload", and what each end costs you.
Cloud computeObject stores for blobs, block volumes for disks, file systems for shared mounts. Picking the wrong one is a rewrite; S3-style object storage is the one you will use most and understand least.
Cloud storagePrincipals, policies, roles, and the evaluation logic that decides every API call. IAM is the cloud's real perimeter — most breaches are an over-broad policy, not a clever exploit.
Cloud identity & accessRDS-style managed relational, DynamoDB-style managed NoSQL. You trade money and some control for backups, patching, and failover you no longer carry a pager for.
Cloud databasesYour own slice of the network, drawn in route tables.
A VPC is the networking you learned in stage one, rebuilt as API objects: subnets, route tables, gateways, and firewalls. Get the topology right early — re-plumbing a production VPC is the cloud equivalent of moving a house.
Address blocks, public vs private subnets, and the route tables that decide where a packet goes next. The diagram you should be able to draw from memory before anything ships.
VPC networkingStateful instance-level firewalls and stateless subnet-level ones. Default-deny, open only what a workload provably needs, and treat the ruleset as reviewed code.
VPC deep diveNAT gateways for outbound-only traffic, endpoints and peering for staying off the public internet. Also where surprise five-figure egress bills are born — know what crosses what.
How NAT worksL4 vs L7, health checks, target groups, connection draining. The traffic cop in front of your fleet — and the first component to interrogate when half the requests fail.
How load balancing worksHosted zones, health-checked failover, weighted and latency-based routing. Route 53-style DNS is the cheapest global traffic-management layer you will ever get.
Route 53Package the app once, let a control loop run it.
Containers package the app and its world together; Kubernetes keeps a declared number of them running and reschedules around failure. Together they are how most cloud workloads actually ship — learn the mechanics, not just the YAML.
Namespaces, cgroups, and a layered filesystem — a kernel feature, not a small VM. Once that clicks, images, isolation limits, and "it needs privileged mode" all make sense.
How containers workEach Dockerfile line is a cached layer. Order them wrong and every build re-downloads the world; order them right and builds take seconds and images stay small.
Sim Container layersYou declare desired state; controllers chase it forever. Follow one pod from kubectl apply to a running container and the whole system stops being magic.
Pod creation, step by stepHow a stable virtual IP finds an ephemeral pod, and how outside traffic gets in. The layer most newcomers get burned by, usually at the worst time.
K8s networkingRolling updates, requests and limits, and what happens when a node fills up. Get requests wrong and you either waste money or watch your pods get evicted.
Sim K8s rolloutInfra in a diff you can review, deploys with no hands on them.
Click-ops cannot be reviewed, repeated, or rolled back. Terraform turns the infrastructure into versioned code with a dry-run; a pipeline turns shipping into a non-event. Together they are the difference between operating and improvising.
Describe the end state, read the computed diff, then apply it. The plan is your safety net before anything mutates — never skip reading it.
External Terraform docsTerraform tracks reality in a state file; unlocked or corrupted, it will fight two engineers at once. Remote state with locking is non-negotiable on a team.
External Terraform stateParameterised building blocks instead of copy-pasted resources, promoted through dev, staging, and prod. The difference between a platform and a pile.
External Terraform modulesStages, artifacts, caching, and quality gates. A green build should mean exactly one thing: this commit is safe to ship — including the infrastructure changes.
External GitHub Actions docsBlue-green, canary, rolling, feature flags. How you change running production without making users your test suite.
Sim K8s rolloutSee the system, define "working", survive the page.
You cannot operate what you cannot see, and you cannot defend "reliable" without a number. Metrics, logs, and traces tell you what is happening; SLOs and error budgets decide what to do about it; incident practice keeps 3am boring.
The three pillars and what each is for: metrics to alert on, logs to investigate with, traces to find where the latency went. Conflating them gets expensive fast.
Observability codexPick the signal users feel, set a target, and the gap becomes your budget for risk. Spend it on shipping until it runs out, then slow down — by agreement, not argument.
External SRE Workbook — implementing SLOsClear roles, a calm channel, mitigation before diagnosis — then a blameless writeup that turns one outage into organisational memory instead of a witch hunt.
External SRE Book — managing incidentsAlert on symptoms users feel, route by severity, and tune relentlessly. A page that fires hourly gets muted, and a muted alert is no alert.
External Prometheus alertingFollow one request across a dozen services with propagated context. The only honest answer to "which hop is slow" once you have more than three of them.
OpenTelemetry & tracingTimeouts, retries with backoff, circuit breakers, bulkheads. The defensive moves that stop one slow dependency from cascading into a full outage.
Sim Circuit breakerLeast privilege on the access, a number on the bill.
The two ways cloud projects quietly fail: an over-broad role that becomes the breach, and a bill nobody can explain. Both are engineering problems with the same fix — make access and spend visible, scoped, and reviewed like code.
Scope policies to what a workload provably needs, prefer roles over long-lived keys, and audit who can assume what. One wildcard policy can be the entire incident.
Cloud identity & accessEnvelope encryption, key policies, and a secrets manager instead of env files in a repo. The bar is "encrypted unless there is a written reason not to."
KMS & secretsThe provider secures the infrastructure; everything you configure on top is yours. Public buckets and open security groups leak more data than zero-days do.
Security codexCompute, storage, and egress make up most of every bill. Tagging, right-sizing, reserved and spot capacity — the bill is a production metric, so treat it like one.
Cloud costDesigning systems that survive the bad day.
Everything so far was about running one system well. Architecture is about what happens when an AZ disappears, traffic triples, or a whole region goes dark — and about the queues, caches, and CDNs that buy you headroom before any of that.
Spread across availability zones, remove single points of failure, and know your real blast radius. HA is a topology decision first and a product feature second.
System design codexA queue between two services absorbs bursts, smooths retries, and decouples deploys — at the price of eventual consistency and a dead-letter queue to babysit.
Message queuesPush static assets to the edge and cache the hot path everywhere else. Latency you remove at the edge is latency you never have to engineer away in the backend.
How a CDN worksBackups you have actually restored, an RTO and RPO someone signed off on, and a strategy — pilot light, warm standby, active-active — matched to what downtime really costs.
External AWS — disaster recovery workloadsData replication, traffic routing, and the consistency trade-offs that come with two sources of truth. The hardest version of every problem in this roadmap, all at once.
Multi-region patternsBackend, system design, frontend, DevOps, security, DSA, data and cloud — the full set, in one place.
OpenVPC packet flow, S3 prefix sharding, K8s rollouts, autoscaling — interactive, in the browser.
OpenLong-form curricula behind the links: cloud, Kubernetes, networking, observability.
OpenTime-boxed practice rounds and concept flashcards.
Open