Filesystem, permissions & processes
Paths, inodes, file descriptors, the process tree, and who is allowed to do what. The mental model that makes everything from Docker layers to systemd units make sense.
Operating systems codexThe full arc — from the shell up through pipelines, containers, clusters, cloud, and the reliability practices that keep it all standing. Every stage is on the critical path. Each topic links to a Semicolony deep dive or simulator where one exists, and to a curated external resource where it doesn't. Follow the arc in order, or jump to wherever you're stuck.
Core plus the recommended layer. The optional stops stay hidden until you have shipped a couple of real systems.
The system everything else runs on top of.
Almost every server you will ever touch is Linux. Get fluent in the filesystem, processes, signals, and the shell — these are the primitives you reach for at 3am when the dashboards are lying.
Paths, inodes, file descriptors, the process tree, and who is allowed to do what. The mental model that makes everything from Docker layers to systemd units make sense.
Operating systems codexbash, pipes, grep/sed/awk, exit codes, and trap. Glue you will write a hundred times before you reach for a real language.
External GNU Bash manualEvery read, write, and fork is a syscall. Knowing the boundary between user space and kernel is what separates guessing from diagnosing.
Sim Syscall journeyFrom BIOS to a login prompt. Understand units, targets, and journald or you will fight your service manager instead of using it.
Sim Linux bootPackets, ports, and why the request hung.
Distributed systems are networking with extra steps. You do not need to be a CCNA, but you must be able to reason about TCP, DNS, TLS, and the layers of indirection between a client and your pod.
Handshakes, the receive window, timeouts, and retransmits. When latency spikes, this is the layer where the truth lives.
How TCP worksThe name-to-address layer that is somehow always the cause. Records, TTLs, caching, and why a stale entry takes you down for exactly the cache lifetime.
How DNS worksCertificates, the handshake, SNI, and mutual TLS. Encryption in transit is table stakes; expired certs are a recurring outage genre.
How HTTPS worksSubnets, route tables, security groups, and address translation. The cloud network model is just these primitives wearing a console.
How NAT worksL4 vs L7, health checks, connection draining. The traffic cop in front of your fleet, and the first thing to blame when half the requests fail.
How load balancing worksGit as the source of truth for everything you ship.
In DevOps, Git is not just for code — it is the audit log for your infrastructure, your pipelines, and your deploys. GitOps lives or dies on understanding what a commit actually is.
Blobs, trees, commits, refs. Once you see Git as a content-addressed store, rebases and merges stop being scary incantations.
How Git worksTrunk-based vs GitFlow, short-lived branches, and review as a quality gate. The team contract that keeps main shippable.
External GitHub flowDeclare desired state in a repo and let a controller reconcile reality to match. The deploy mechanism that is also your rollback button.
External Argo CD docsFrom a commit to production, with no hands on it.
Continuous integration catches breakage early; continuous delivery makes shipping boring. The goal is a pipeline so trustworthy that deploying on a Friday afternoon is a non-event.
Stages, jobs, artifacts, caching, and fail-fast. A green build should mean exactly one thing: this is safe to ship.
External GitHub Actions docsUnit, integration, lint, coverage thresholds. Gates that block bad code cheaply, before it costs you a rollback.
External GitHub Actions — building & testingBlue-green, canary, rolling, feature flags. How you change running production without making users your test suite.
Sim K8s rolloutVersioned images and packages, immutable and signed. Knowing exactly what bits are in production is the foundation of every later security claim.
External Docker registry docsShipping the whole environment, not just the code.
Containers killed "works on my machine" by packaging the app and its world together. Under the hood they are just Linux namespaces and cgroups — once that clicks, the magic becomes mechanics.
Namespaces, cgroups, and a layered filesystem — not a tiny VM. The isolation is a kernel feature, not a hypervisor.
How containers workEach Dockerfile line is a cached layer. Order them wrong and every build re-downloads the world; order them right and builds are seconds.
Sim Container layersBuild in a fat image, ship in a tiny one. Smaller images mean faster pulls, smaller attack surface, fewer CVEs to triage.
External Multi-stage buildsBridge networks, port publishing, and persistent volumes. Where state lives when the container itself is disposable.
External Docker networkingYou declare the state; it keeps reality matching.
Kubernetes is a control loop that keeps your declared state and actual state in sync — and reschedules around failure so you do not have to. Steep curve, but it is the lingua franca of modern infra.
The control loop that keeps your declared state and actual state in sync — and pages you when they diverge. Everything in k8s is a variation on this theme.
Sim Pod creationDeclare how many replicas you want and let the controller chase it. Rollouts and rollbacks become a single field change.
Sim K8s rolloutHow a stable virtual IP finds an ephemeral pod, and how traffic gets in from outside. The networking layer most newcomers get burned by.
K8s networkingHow pods land on nodes, and what happens when you over-commit. Get requests wrong and you either waste money or get evicted.
Sim Container schedulerTemplated, versioned releases instead of a folder of YAML. Operators take it further — encoding the human runbook into a controller.
External Helm docsYour whole cloud, in a diff you can review.
Click-ops does not scale and cannot be reviewed. IaC turns infrastructure into versioned, planned, peer-reviewed code — so the thing you deploy is the thing you read.
Describe the end state, let the tool compute the diff, then apply it. The plan is your dry-run safety net before anything mutates.
External Terraform docsTerraform tracks reality in a state file — corrupt or unlocked, it will fight two engineers at once. Remote state with locking is non-negotiable for teams.
External Terraform stateReusable, parameterised building blocks for your infra. The difference between a maintained platform and a pile of copy-pasted resources.
External Terraform modulesReality drifts from code when someone clicks. Detecting drift and importing existing resources keeps the source of truth honest across dev/staging/prod.
External Import existing resourcesGetting config and secrets where they belong, safely.
Config and secrets are where deploys quietly go wrong. Separate config from code, keep secrets out of Git, and make rotation a routine rather than an emergency.
Config belongs in the environment, not baked into the image. One artifact, many environments — promoted, not rebuilt.
External The Twelve-Factor App — configVaults, encrypted stores, and dynamic credentials. A secret in a repo is a secret leaked — assume it the moment it lands.
External HashiCorp Vault docsInject config and credentials into pods without rebuilding images. Just remember k8s Secrets are base64, not encrypted, by default.
External ConfigMapsShort-lived, narrowly-scoped credentials beat long-lived god-keys every time. Make rotation automatic so it actually happens.
External AWS IAM best practicesRenting compute, storage, and reliability by the API call.
The big three rent you compute, storage, networking, and a hundred managed services. The skill is not memorising one provider — it is the shared mental model that ports across all of them.
VMs, object storage, block storage, VPCs. The handful of services every higher-level offering is ultimately built on.
Cloud codexIdentity and access is the cloud control plane — and the most common breach vector. Know exactly which half of security is yours.
External AWS IAM docsManaged databases, queues, and caches trade money and lock-in for operational toil you no longer carry. Choose where you actually want the pager.
Cloud codexPush static assets and caching to the edge, close to users. Latency you remove at the edge is latency you never have to engineer away.
How a CDN worksMetrics, logs, and traces, so you stop guessing.
You cannot operate what you cannot see. The three pillars — metrics, logs, traces — turn "users say it is slow" into "p99 on the checkout service regressed after the 14:02 deploy."
Counters, gauges, histograms, and the RED/USE methods. The cheap, high-cardinality-averse signal you alert on first.
External Prometheus docsMachine-parseable logs with correlation IDs beat a wall of free text. Centralise them or you will be SSH-ing into nodes during the outage.
External OpenTelemetry logsFollow one request across a dozen services. The only way to find where the latency actually went in a microservice mesh.
External OpenTelemetry tracingDashboards for humans, alerts for pagers — and never the two confused. Alert on symptoms users feel, not every internal twitch.
External Prometheus alertingHolding up when the traffic actually shows up.
Scaling is not just adding boxes — it is autoscaling on the right signal, shedding load gracefully, and protecting downstreams from each other. The art is degrading instead of collapsing.
Add replicas or grow the box — and let a controller do it on a real signal. Autoscaling on the wrong metric is just an expensive way to thrash.
How autoscaling worksRound-robin, least-connections, consistent hashing. The algorithm decides whether one hot node ruins everyone else, day.
Sim Load balancerBrowser, CDN, app, database. The fastest query is the one you never make — but cache invalidation will keep you humble.
How caching worksProtect the system from a stampede by rejecting some requests on purpose. Shedding load is how a service stays up instead of melting down.
Sim Rate limiterSLOs, error budgets, and what to do at 3am.
SRE is reliability treated as an engineering discipline, with a budget. Define what "working" means numerically, spend the error budget deliberately, and run incidents like the routine they should be.
Pick the signal users feel, set a target, and the gap becomes your budget for risk. Spend it on velocity until it runs out, then slow down.
External SRE Workbook — implementing SLOsClear roles, a calm comms channel, and a bias toward mitigation over diagnosis. The incident is not the time to be a hero — it is the time to be boring.
External SRE Book — managing incidentsSystems fail; people respond to incentives. A blameless writeup turns one outage into durable organisational learning instead of a witch hunt.
External SRE Book — postmortem cultureTimeouts, retries with backoff, circuit breakers, bulkheads. The defensive moves that stop one slow dependency from cascading into a full outage.
Sim Circuit breakerBreak things on purpose, in controlled blast radii, before they break themselves at 3am. You only really know your failover works once you have tripped it.
Sim Chaos playgroundBuilding security into the pipeline, not bolting it on.
Security is not a gate at the end — it is a property you build in and continuously verify. Scan the supply chain, lock down the runtime, and assume breach so detection actually matters.
SAST, dependency, and image scanning in the pipeline catch issues while they are cheap. The earlier the finding, the smaller the blast radius.
Security codexMinimal base images, signed artifacts, and an SBOM you can audit. Know exactly what is running, and prove it.
External Docker securityRBAC, network policies, pod security standards, and admission control. A cluster default-open is a cluster default-breached.
External Kubernetes securitymTLS, segmentation, and never trusting the network. Identity at every hop replaces the soft-chewy-center perimeter model.
Sim Service meshTurning ops into a product other engineers self-serve.
Platform engineering treats your internal tooling as a product whose users are other engineers. The win is a golden path so smooth that doing the right thing is also the easy thing.
Paved roads, sane defaults, and self-service that removes tickets. The goal is shipping without a human in the loop for the common case.
HandbookOne place to find every service, its owner, docs, and health. Backstage-style portals fight the entropy of a growing org.
External Backstage docsTemplated modules and guardrails that let teams provision safely without filing a ticket to the platform team. Autonomy with a safety net.
External Crossplane docsBackend, system design, frontend, DevOps, security, DSA and data — the full set, in one place.
OpenKubernetes rollouts, autoscaling, circuit breakers, chaos — interactive, in the browser.
OpenLong-form curricula behind the links: Kubernetes, cloud, operating systems, networking.
OpenTime-boxed practice rounds and concept flashcards.
Open