14 stages · 69 topics · 32 core

Roadmap

Become a DevOps engineer.

The full arc — from the shell up through pipelines, containers, clusters, cloud, and the reliability practices that keep it all standing. Every stage is on the critical path. Each topic links to a Semicolony deep dive or simulator where one exists, and to a curated external resource where it doesn't. Follow the arc in order, or jump to wherever you're stuck.

Also available System Design Roadmap → 15 stages, with an interactive architecture diagram. Also available Backend Engineer Roadmap → 14 stages, from HTTP basics to distributed systems.

Core (the spine) Recommended (strong upside) Optional (pick if relevant)

Path

Level

Core plus the recommended layer. The optional stops stay hidden until you have shipped a couple of real systems.

Jump to a stage

01 Linux & the shell 02 Networking for ops 03 Version control & collaboration 04 CI/CD pipelines 05 Containers (Docker) 06 Container orchestration (Kubernetes) 07 Infrastructure as Code (Terraform) 08 Configuration & secrets 09 Cloud platforms 10 Observability 11 Scaling & load management 12 Reliability engineering 13 Security & compliance (DevSecOps) 14 Platform & developer experience

Stage

Linux & the shell

The system everything else runs on top of.

Almost every server you will ever touch is Linux. Get fluent in the filesystem, processes, signals, and the shell — these are the primitives you reach for at 3am when the dashboards are lying.

Core

Filesystem, permissions & processes

Paths, inodes, file descriptors, the process tree, and who is allowed to do what. The mental model that makes everything from Docker layers to systemd units make sense.

Operating systems codex

Sim Filesystem External The Linux man pages External kernel.org docs

Core

Shell scripting & the toolchain

bash, pipes, grep/sed/awk, exit codes, and trap. Glue you will write a hundred times before you reach for a real language.

External GNU Bash manual

External coreutils manual Operating systems codex

System calls & how programs talk to the kernel

Every read, write, and fork is a syscall. Knowing the boundary between user space and kernel is what separates guessing from diagnosing.

Sim Syscall journey

External syscalls(2) Operating systems codex

Boot, init & systemd

From BIOS to a login prompt. Understand units, targets, and journald or you will fight your service manager instead of using it.

Sim Linux boot

External systemd documentation Operating systems codex

Stage

Networking for ops

Packets, ports, and why the request hung.

Distributed systems are networking with extra steps. You do not need to be a CCNA, but you must be able to reason about TCP, DNS, TLS, and the layers of indirection between a client and your pod.

Core

TCP/IP & the connection lifecycle

Handshakes, the receive window, timeouts, and retransmits. When latency spikes, this is the layer where the truth lives.

How TCP works

Sim TCP handshake Networking codex External RFC 9293 — TCP

Core

DNS & service resolution

The name-to-address layer that is somehow always the cause. Records, TTLs, caching, and why a stale entry takes you down for exactly the cache lifetime.

How DNS works

Sim DNS resolution Networking codex

Core

TLS & HTTPS

Certificates, the handshake, SNI, and mutual TLS. Encryption in transit is table stakes; expired certs are a recurring outage genre.

How HTTPS works

Security codex External Let's Encrypt docs

NAT, VPCs & private networking

Subnets, route tables, security groups, and address translation. The cloud network model is just these primitives wearing a console.

How NAT works

VPC networking Sim VPC packet flow

Load balancers & reverse proxies

L4 vs L7, health checks, connection draining. The traffic cop in front of your fleet, and the first thing to blame when half the requests fail.

How load balancing works

Reverse proxy Sim Load balancer

Stage

Version control & collaboration

Git as the source of truth for everything you ship.

In DevOps, Git is not just for code — it is the audit log for your infrastructure, your pipelines, and your deploys. GitOps lives or dies on understanding what a commit actually is.

Core

Git internals & the object model

Blobs, trees, commits, refs. Once you see Git as a content-addressed store, rebases and merges stop being scary incantations.

How Git works

External Pro Git book

Core

Branching strategies & PR workflow

Trunk-based vs GitFlow, short-lived branches, and review as a quality gate. The team contract that keeps main shippable.

External GitHub flow

External Pro Git — branching

GitOps — Git as the deployment trigger

Declare desired state in a repo and let a controller reconcile reality to match. The deploy mechanism that is also your rollback button.

External Argo CD docs

External Flux docs Kubernetes codex

Stage

CI/CD pipelines

From a commit to production, with no hands on it.

Continuous integration catches breakage early; continuous delivery makes shipping boring. The goal is a pipeline so trustworthy that deploying on a Friday afternoon is a non-event.

Core

Pipeline fundamentals

Stages, jobs, artifacts, caching, and fail-fast. A green build should mean exactly one thing: this is safe to ship.

External GitHub Actions docs

External GitLab CI/CD docs Handbook

Core

Build, test & quality gates

Unit, integration, lint, coverage thresholds. Gates that block bad code cheaply, before it costs you a rollback.

External GitHub Actions — building & testing

External GitLab CI testing

Core

Deployment strategies

Blue-green, canary, rolling, feature flags. How you change running production without making users your test suite.

Sim K8s rollout

System design codex External Argo Rollouts docs

Artifact registries & supply chain

Versioned images and packages, immutable and signed. Knowing exactly what bits are in production is the foundation of every later security claim.

External Docker registry docs

External SLSA framework Security codex

Stage

Containers (Docker)

Shipping the whole environment, not just the code.

Containers killed "works on my machine" by packaging the app and its world together. Under the hood they are just Linux namespaces and cgroups — once that clicks, the magic becomes mechanics.

Core

What a container actually is

Namespaces, cgroups, and a layered filesystem — not a tiny VM. The isolation is a kernel feature, not a hypervisor.

How containers work

Sim Docker internals Operating systems codex

Core

Images, layers & the build cache

Each Dockerfile line is a cached layer. Order them wrong and every build re-downloads the world; order them right and builds are seconds.

Sim Container layers

External Dockerfile reference External Docker build cache

Multi-stage builds & slim images

Build in a fat image, ship in a tiny one. Smaller images mean faster pulls, smaller attack surface, fewer CVEs to triage.

External Multi-stage builds

External Docker best practices

Container networking & volumes

Bridge networks, port publishing, and persistent volumes. Where state lives when the container itself is disposable.

External Docker networking

External Docker volumes Sim Docker internals

Stage

Container orchestration (Kubernetes)

You declare the state; it keeps reality matching.

Kubernetes is a control loop that keeps your declared state and actual state in sync — and reschedules around failure so you do not have to. Steep curve, but it is the lingua franca of modern infra.

Core

Pods, the control plane & reconciliation

The control loop that keeps your declared state and actual state in sync — and pages you when they diverge. Everything in k8s is a variation on this theme.

Sim Pod creation

Kubernetes codex External Kubernetes concepts

Core

Deployments, ReplicaSets & rollouts

Declare how many replicas you want and let the controller chase it. Rollouts and rollbacks become a single field change.

Sim K8s rollout

External Deployments Kubernetes codex

Core

Services, ingress & cluster networking

How a stable virtual IP finds an ephemeral pod, and how traffic gets in from outside. The networking layer most newcomers get burned by.

K8s networking

Service discovery External Kubernetes services

Scheduling, requests & limits

How pods land on nodes, and what happens when you over-commit. Get requests wrong and you either waste money or get evicted.

Sim Container scheduler

Sim Pod eviction External Resource management

Packaging with Helm & operators

Templated, versioned releases instead of a folder of YAML. Operators take it further — encoding the human runbook into a controller.

External Helm docs

External Operator pattern Kubernetes codex

Stage

Infrastructure as Code (Terraform)

Your whole cloud, in a diff you can review.

Click-ops does not scale and cannot be reviewed. IaC turns infrastructure into versioned, planned, peer-reviewed code — so the thing you deploy is the thing you read.

Core

Declarative infra & the plan/apply cycle

Describe the end state, let the tool compute the diff, then apply it. The plan is your dry-run safety net before anything mutates.

External Terraform docs

External Terraform tutorials Cloud codex

Core

State, backends & locking

Terraform tracks reality in a state file — corrupt or unlocked, it will fight two engineers at once. Remote state with locking is non-negotiable for teams.

External Terraform state

External Remote backends Sim Distributed lock

Modules & composition

Reusable, parameterised building blocks for your infra. The difference between a maintained platform and a pile of copy-pasted resources.

External Terraform modules

External Module registry

Drift, imports & multi-environment

Reality drifts from code when someone clicks. Detecting drift and importing existing resources keeps the source of truth honest across dev/staging/prod.

External Import existing resources

External Workspaces Cloud codex

Stage

Configuration & secrets

Getting config and secrets where they belong, safely.

Config and secrets are where deploys quietly go wrong. Separate config from code, keep secrets out of Git, and make rotation a routine rather than an emergency.

Core

Config management & 12-factor

Config belongs in the environment, not baked into the image. One artifact, many environments — promoted, not rebuilt.

External The Twelve-Factor App — config

Handbook

Core

Secrets management

Vaults, encrypted stores, and dynamic credentials. A secret in a repo is a secret leaked — assume it the moment it lands.

External HashiCorp Vault docs

External AWS Secrets Manager Security codex

Kubernetes ConfigMaps & Secrets

Inject config and credentials into pods without rebuilding images. Just remember k8s Secrets are base64, not encrypted, by default.

External ConfigMaps

External Secrets Kubernetes codex

Rotation & least privilege

Short-lived, narrowly-scoped credentials beat long-lived god-keys every time. Make rotation automatic so it actually happens.

External AWS IAM best practices

Security codex

Stage

Cloud platforms

Renting compute, storage, and reliability by the API call.

The big three rent you compute, storage, networking, and a hundred managed services. The skill is not memorising one provider — it is the shared mental model that ports across all of them.

Core

Core primitives — compute, storage, networking

VMs, object storage, block storage, VPCs. The handful of services every higher-level offering is ultimately built on.

Cloud codex

External AWS documentation External Google Cloud docs

Core

IAM & the shared responsibility model

Identity and access is the cloud control plane — and the most common breach vector. Know exactly which half of security is yours.

External AWS IAM docs

External Azure RBAC Security codex

Managed services vs self-hosting

Managed databases, queues, and caches trade money and lock-in for operational toil you no longer carry. Choose where you actually want the pager.

Cloud codex

External AWS Well-Architected System design codex

Edge, CDN & global delivery

Push static assets and caching to the edge, close to users. Latency you remove at the edge is latency you never have to engineer away.

How a CDN works

Sim CloudFront cache API gateway Networking codex

Stage

Observability

Metrics, logs, and traces, so you stop guessing.

You cannot operate what you cannot see. The three pillars — metrics, logs, traces — turn "users say it is slow" into "p99 on the checkout service regressed after the 14:02 deploy."

Core

Metrics & time series

Counters, gauges, histograms, and the RED/USE methods. The cheap, high-cardinality-averse signal you alert on first.

External Prometheus docs

External Grafana docs Performance codex

Core

Structured logging & aggregation

Machine-parseable logs with correlation IDs beat a wall of free text. Centralise them or you will be SSH-ing into nodes during the outage.

External OpenTelemetry logs

External Grafana Loki

Core

Distributed tracing

Follow one request across a dozen services. The only way to find where the latency actually went in a microservice mesh.

External OpenTelemetry tracing

External Jaeger docs System design codex

Dashboards & alerting

Dashboards for humans, alerts for pagers — and never the two confused. Alert on symptoms users feel, not every internal twitch.

External Prometheus alerting

External Grafana alerting

Stage

Scaling & load management

Holding up when the traffic actually shows up.

Scaling is not just adding boxes — it is autoscaling on the right signal, shedding load gracefully, and protecting downstreams from each other. The art is degrading instead of collapsing.

Core

Horizontal vs vertical scaling & autoscaling

Add replicas or grow the box — and let a controller do it on a real signal. Autoscaling on the wrong metric is just an expensive way to thrash.

How autoscaling works

Sim Autoscaling External Kubernetes HPA

Core

Load balancing strategies

Round-robin, least-connections, consistent hashing. The algorithm decides whether one hot node ruins everyone else, day.

Sim Load balancer

How load balancing works System design codex

Caching at every layer

Browser, CDN, app, database. The fastest query is the one you never make — but cache invalidation will keep you humble.

How caching works

Sim Distributed cache Performance codex Sim Thundering herd

Rate limiting & load shedding

Protect the system from a stampede by rejecting some requests on purpose. Shedding load is how a service stays up instead of melting down.

Sim Rate limiter

API gateway System design codex

Stage

Reliability engineering

SLOs, error budgets, and what to do at 3am.

SRE is reliability treated as an engineering discipline, with a budget. Define what "working" means numerically, spend the error budget deliberately, and run incidents like the routine they should be.

Core

SLI, SLO & error budgets

Pick the signal users feel, set a target, and the gap becomes your budget for risk. Spend it on velocity until it runs out, then slow down.

External SRE Workbook — implementing SLOs

External Google SRE Book System design codex

Core

Incident response & on-call

Clear roles, a calm comms channel, and a bias toward mitigation over diagnosis. The incident is not the time to be a hero — it is the time to be boring.

External SRE Book — managing incidents

External PagerDuty incident response Handbook

Core

Blameless postmortems

Systems fail; people respond to incentives. A blameless writeup turns one outage into durable organisational learning instead of a witch hunt.

External SRE Book — postmortem culture

External SRE Workbook — postmortems

Resilience patterns

Timeouts, retries with backoff, circuit breakers, bulkheads. The defensive moves that stop one slow dependency from cascading into a full outage.

Sim Circuit breaker

Sim Retry strategy System design codex

Chaos engineering

Break things on purpose, in controlled blast radii, before they break themselves at 3am. You only really know your failover works once you have tripped it.

Sim Chaos playground

External Principles of Chaos

Stage

Security & compliance (DevSecOps)

Building security into the pipeline, not bolting it on.

Security is not a gate at the end — it is a property you build in and continuously verify. Scan the supply chain, lock down the runtime, and assume breach so detection actually matters.

Core

Shift-left scanning

SAST, dependency, and image scanning in the pipeline catch issues while they are cheap. The earlier the finding, the smaller the blast radius.

Security codex

External OWASP Top Ten External Trivy docs

Core

Container & supply-chain security

Minimal base images, signed artifacts, and an SBOM you can audit. Know exactly what is running, and prove it.

External Docker security

External Sigstore docs External SLSA framework

Kubernetes security & policy

RBAC, network policies, pod security standards, and admission control. A cluster default-open is a cluster default-breached.

External Kubernetes security

External Pod Security Standards Kubernetes codex

Network security & zero trust

mTLS, segmentation, and never trusting the network. Identity at every hop replaces the soft-chewy-center perimeter model.

Sim Service mesh

How HTTPS works Security codex

Stage

Platform & developer experience

Turning ops into a product other engineers self-serve.

Platform engineering treats your internal tooling as a product whose users are other engineers. The win is a golden path so smooth that doing the right thing is also the easy thing.

Core

Internal developer platforms & golden paths

Paved roads, sane defaults, and self-service that removes tickets. The goal is shipping without a human in the loop for the common case.

Handbook

External CNCF Platforms white paper System design codex

Service catalogs & developer portals

One place to find every service, its owner, docs, and health. Backstage-style portals fight the entropy of a growing org.

External Backstage docs

Handbook

Self-service infrastructure

Templated modules and guardrails that let teams provision safely without filing a ticket to the platform team. Autonomy with a safety net.

External Crossplane docs

External Terraform modules Cloud codex

All paths

All roadmaps

Backend, system design, frontend, DevOps, security, DSA and data — the full set, in one place.

Open

Hands-on

Run the simulators

Kubernetes rollouts, autoscaling, circuit breakers, chaos — interactive, in the browser.

Open

Go deeper

The codex

Long-form curricula behind the links: Kubernetes, cloud, operating systems, networking.

Open

Drill

Interview prep

Time-boxed practice rounds and concept flashcards.

Open