8 stages · 47 topics · 29 core

Roadmap

Become a cloud engineer.

The full arc — from Linux and the network stack up through cloud primitives, VPCs, containers, infrastructure as code, observability, security, cost, and the architecture patterns that survive a region failure. Every stage is on the critical path. Each topic links to a Semicolony deep dive or simulator where one exists, and to a curated external resource where it doesn't. Follow the arc in order, or jump to wherever you're stuck.

Also available DevOps & SRE Roadmap → 14 stages, from the Linux shell to reliability engineering. Also available System Design Roadmap → 15 stages, with an interactive architecture diagram.

Core (the spine) Recommended (strong upside) Optional (pick if relevant)

Path

Level

Core plus the recommended layer. The optional stops stay hidden until you have shipped a couple of real systems.

Jump to a stage

01 Foundations — Linux & the network 02 Core cloud primitives 03 VPC & cloud networking 04 Containers & Kubernetes 05 Infrastructure as code & CI/CD 06 Observability & reliability 07 Security & cost 08 Architecture — HA, DR & multi-region

Stage

Foundations — Linux & the network

The box, the wire, and the protocols on top.

The cloud is other people's Linux behind an API. Before the consoles and the acronyms, get fluent in the machine itself — processes, the shell, and the TCP/DNS/TLS stack that every request you will ever debug rides on.

Core

Linux & the shell

Filesystems, permissions, processes, and the pipe-and-grep toolchain. Every instance you launch boots into this; every weird production behaviour eventually drops you back here.

Operating systems codex

Sim Linux boot External GNU Bash manual

Core

TCP/IP & HTTP

Handshakes, ports, timeouts, retransmits, and the request/response cycle on top. When a service "is slow", the answer is almost always somewhere in these two layers.

How TCP works

HTTP, the shape of it Networking codex

Core

DNS

Records, TTLs, resolvers, and caching. Cloud platforms lean on DNS for service discovery, failover, and traffic shifting — a stale record takes you down for exactly the cache lifetime.

How DNS works

Sim DNS resolution DNS, explained simply

Core

TLS & HTTPS

Certificates, the handshake, and the chain of trust. Everything in the cloud talks TLS; expired certs and broken chains are a recurring outage genre you should be able to diagnose on sight.

How HTTPS works

Security codex External Let's Encrypt docs

SSH, keys & remote machines

Key pairs, agents, jump hosts, and tunnels. Knowing exactly what happens when you SSH into a box is the baseline this whole roadmap builds from.

External OpenSSH manual pages

Operating systems codex

Stage

Core cloud primitives

Compute, storage, and identity — the three you build everything from.

Strip away the hundred-service catalog and the cloud is three things: machines you rent, bytes you store, and an identity system deciding who may touch which. Learn these provider-agnostic, with AWS as the running example — the model ports.

Core

The cloud mental model

Regions, availability zones, the API behind the console, and what the provider is actually selling you. The shared vocabulary every later stage assumes.

Cloud codex

AWS foundations External AWS documentation

Core

Compute — VMs, autoscaling & serverless

Instances, machine images, autoscaling groups, and functions that only exist while they run. The spectrum from "a box you manage" to "a handler you upload", and what each end costs you.

Cloud compute

EC2, EBS & AMIs Lambda execution model

Core

Storage — object, block & file

Object stores for blobs, block volumes for disks, file systems for shared mounts. Picking the wrong one is a rewrite; S3-style object storage is the one you will use most and understand least.

Cloud storage

S3 internals Sim S3 prefix sharding

Core

IAM — identity & access

Principals, policies, roles, and the evaluation logic that decides every API call. IAM is the cloud's real perimeter — most breaches are an over-broad policy, not a clever exploit.

Cloud identity & access

IAM, the advanced parts External AWS IAM best practices

Managed databases

RDS-style managed relational, DynamoDB-style managed NoSQL. You trade money and some control for backups, patching, and failover you no longer carry a pager for.

Cloud databases

Aurora & RDS DynamoDB internals

Stage

VPC & cloud networking

Your own slice of the network, drawn in route tables.

A VPC is the networking you learned in stage one, rebuilt as API objects: subnets, route tables, gateways, and firewalls. Get the topology right early — re-plumbing a production VPC is the cloud equivalent of moving a house.

Core

VPCs, subnets & route tables

Address blocks, public vs private subnets, and the route tables that decide where a packet goes next. The diagram you should be able to draw from memory before anything ships.

VPC networking

Sim VPC packet flow Cloud networking

Core

Security groups & network ACLs

Stateful instance-level firewalls and stateless subnet-level ones. Default-deny, open only what a workload provably needs, and treat the ruleset as reviewed code.

VPC deep dive

VPC networking External AWS VPC security best practices

Core

NAT & private connectivity

NAT gateways for outbound-only traffic, endpoints and peering for staying off the public internet. Also where surprise five-figure egress bills are born — know what crosses what.

How NAT works

Sim VPC packet flow Cloud networking

Core

Load balancers

L4 vs L7, health checks, target groups, connection draining. The traffic cop in front of your fleet — and the first component to interrogate when half the requests fail.

How load balancing works

Sim Load balancer AWS load balancing

Cloud DNS & traffic routing

Hosted zones, health-checked failover, weighted and latency-based routing. Route 53-style DNS is the cheapest global traffic-management layer you will ever get.

Route 53

How DNS works

Stage

Containers & Kubernetes

Package the app once, let a control loop run it.

Containers package the app and its world together; Kubernetes keeps a declared number of them running and reschedules around failure. Together they are how most cloud workloads actually ship — learn the mechanics, not just the YAML.

Core

What a container actually is

Namespaces, cgroups, and a layered filesystem — a kernel feature, not a small VM. Once that clicks, images, isolation limits, and "it needs privileged mode" all make sense.

How containers work

Sim Docker internals Containers, explained simply

Core

Images, layers & the build cache

Each Dockerfile line is a cached layer. Order them wrong and every build re-downloads the world; order them right and builds take seconds and images stay small.

Sim Container layers

External Dockerfile reference External Multi-stage builds

Core

The Kubernetes model — pods & reconciliation

You declare desired state; controllers chase it forever. Follow one pod from kubectl apply to a running container and the whole system stops being magic.

Pod creation, step by step

Kubernetes codex External Kubernetes concepts

Services, ingress & cluster networking

How a stable virtual IP finds an ephemeral pod, and how outside traffic gets in. The layer most newcomers get burned by, usually at the worst time.

K8s networking

Service discovery External Kubernetes services

Rollouts, scheduling & resource limits

Rolling updates, requests and limits, and what happens when a node fills up. Get requests wrong and you either waste money or watch your pods get evicted.

Sim K8s rollout

Sim Container scheduler Sim Pod eviction

Stage

Infrastructure as code & CI/CD

Infra in a diff you can review, deploys with no hands on them.

Click-ops cannot be reviewed, repeated, or rolled back. Terraform turns the infrastructure into versioned code with a dry-run; a pipeline turns shipping into a non-event. Together they are the difference between operating and improvising.

Core

Terraform — declare, plan, apply

Describe the end state, read the computed diff, then apply it. The plan is your safety net before anything mutates — never skip reading it.

External Terraform docs

External Terraform tutorials Cloud codex

Core

State, backends & locking

Terraform tracks reality in a state file; unlocked or corrupted, it will fight two engineers at once. Remote state with locking is non-negotiable on a team.

External Terraform state

External Remote backends

Core

Modules & multi-environment

Parameterised building blocks instead of copy-pasted resources, promoted through dev, staging, and prod. The difference between a platform and a pile.

External Terraform modules

External Terraform workspaces

Core

CI/CD pipelines

Stages, artifacts, caching, and quality gates. A green build should mean exactly one thing: this commit is safe to ship — including the infrastructure changes.

External GitHub Actions docs

External GitLab CI/CD docs How Git works

Deployment strategies

Blue-green, canary, rolling, feature flags. How you change running production without making users your test suite.

Sim K8s rollout

System design codex External Argo Rollouts docs

Stage

Observability & reliability

See the system, define "working", survive the page.

You cannot operate what you cannot see, and you cannot defend "reliable" without a number. Metrics, logs, and traces tell you what is happening; SLOs and error budgets decide what to do about it; incident practice keeps 3am boring.

Core

Metrics, logs & traces

The three pillars and what each is for: metrics to alert on, logs to investigate with, traces to find where the latency went. Conflating them gets expensive fast.

Observability codex

Logs, metrics & traces External Prometheus docs

Core

SLIs, SLOs & error budgets

Pick the signal users feel, set a target, and the gap becomes your budget for risk. Spend it on shipping until it runs out, then slow down — by agreement, not argument.

External SRE Workbook — implementing SLOs

External Google SRE Book Cloud observability

Core

Incident response & postmortems

Clear roles, a calm channel, mitigation before diagnosis — then a blameless writeup that turns one outage into organisational memory instead of a witch hunt.

External SRE Book — managing incidents

External SRE Book — postmortem culture Handbook

Alerting that people trust

Alert on symptoms users feel, route by severity, and tune relentlessly. A page that fires hourly gets muted, and a muted alert is no alert.

External Prometheus alerting

External SRE Workbook — alerting on SLOs

Distributed tracing & OpenTelemetry

Follow one request across a dozen services with propagated context. The only honest answer to "which hop is slow" once you have more than three of them.

OpenTelemetry & tracing

External OpenTelemetry docs

Resilience patterns

Timeouts, retries with backoff, circuit breakers, bulkheads. The defensive moves that stop one slow dependency from cascading into a full outage.

Sim Circuit breaker

Sim Retry strategy Orchestration & resiliency

Stage

Security & cost

Least privilege on the access, a number on the bill.

The two ways cloud projects quietly fail: an over-broad role that becomes the breach, and a bill nobody can explain. Both are engineering problems with the same fix — make access and spend visible, scoped, and reviewed like code.

Core

Least privilege IAM

Scope policies to what a workload provably needs, prefer roles over long-lived keys, and audit who can assume what. One wildcard policy can be the entire incident.

Cloud identity & access

IAM, the advanced parts External AWS IAM best practices

Core

Encryption, KMS & secrets

Envelope encryption, key policies, and a secrets manager instead of env files in a repo. The bar is "encrypted unless there is a written reason not to."

KMS & secrets

External AWS KMS concepts Security codex

Core

The shared responsibility model

The provider secures the infrastructure; everything you configure on top is yours. Public buckets and open security groups leak more data than zero-days do.

Security codex

External AWS shared responsibility model

Core

Cost & FinOps basics

Compute, storage, and egress make up most of every bill. Tagging, right-sizing, reserved and spot capacity — the bill is a production metric, so treat it like one.

Cloud cost

How to estimate cost External FinOps Foundation

Stage

Architecture — HA, DR & multi-region

Designing systems that survive the bad day.

Everything so far was about running one system well. Architecture is about what happens when an AZ disappears, traffic triples, or a whole region goes dark — and about the queues, caches, and CDNs that buy you headroom before any of that.

Core

High availability & failure domains

Spread across availability zones, remove single points of failure, and know your real blast radius. HA is a topology decision first and a product feature second.

System design codex

External AWS Well-Architected — reliability pillar Cloud codex

Core

Queues & async work

A queue between two services absorbs bursts, smooths retries, and decouples deploys — at the price of eventual consistency and a dead-letter queue to babysit.

Message queues

When to introduce a queue AWS messaging

Core

Caching & CDNs

Push static assets to the edge and cache the hot path everywhere else. Latency you remove at the edge is latency you never have to engineer away in the backend.

How a CDN works

How caching works Sim CloudFront cache

Disaster recovery

Backups you have actually restored, an RTO and RPO someone signed off on, and a strategy — pilot light, warm standby, active-active — matched to what downtime really costs.

External AWS — disaster recovery workloads

Multi-region patterns

Multi-region design

Data replication, traffic routing, and the consistency trade-offs that come with two sources of truth. The hardest version of every problem in this roadmap, all at once.

Multi-region patterns

Sim CAP theorem System design codex

All paths

All roadmaps

Backend, system design, frontend, DevOps, security, DSA, data and cloud — the full set, in one place.

Open

Hands-on

Run the simulators

VPC packet flow, S3 prefix sharding, K8s rollouts, autoscaling — interactive, in the browser.

Open

Go deeper

The codex

Long-form curricula behind the links: cloud, Kubernetes, networking, observability.

Open

Drill

Interview prep

Time-boxed practice rounds and concept flashcards.

Open