14 stages · 70 topics · 34 core

Roadmap

Become a data engineer.

All stages, in order. The full arc from SQL fundamentals through storage engines, batch and streaming pipelines, the warehouse, and out the far end into feature stores and model serving. This is the broad track: you end up able to land data anywhere it needs to go and reason about every layer it passes through. Each topic links to a Semicolony deep dive or simulator where one exists, and to a curated external resource where it doesn't. Follow the arc in order, or jump to wherever you're stuck.

Also available System Design Roadmap → 15 stages, with an interactive architecture diagram. Also available Backend Engineer Roadmap → 14 stages, from HTTP basics to distributed systems.

Core (the spine) Recommended (strong upside) Optional (pick if relevant)

Path

Level

Core plus the recommended layer. The optional stops stay hidden until you have shipped a couple of real systems.

Jump to a stage

01 Data foundations & SQL 02 Relational databases & modeling 03 NoSQL & data stores 04 Storage engines & file formats 05 Batch processing 06 Streaming 07 Data pipelines & orchestration 08 Data warehousing & lakehouse 09 Data quality, governance & observability 10 Distributed data & partitioning 11 ML fundamentals 12 Feature engineering & feature stores 13 Model training & experimentation 14 Model serving & MLOps

Stage

Data foundations & SQL

The language all data work starts with, and the model under it.

Everything downstream is built on relational thinking, even the systems that claim to have escaped it. Before Spark or feature stores mean anything, you need SQL in your fingers and a real picture of how the engine turns a declarative query into a plan: scans, joins, aggregations, and the order they actually run in.

Core

SQL fundamentals

SELECT, WHERE, GROUP BY, HAVING, the lot. The order you write a query is not the order it executes — FROM and JOIN run first, SELECT almost last. Internalize that and half of SQL stops being surprising.

External PostgreSQL: SQL language

Sim SQL query execution External Mode SQL tutorial External use the index, Luke

Core

Joins, in depth

Inner, left, the dreaded cross join you triggered by accident. The interesting part is physical: nested loop is fine for small inputs, hash join wins on big unsorted ones, merge join wants sorted data. The planner picks; you should know why.

Sim SQL joins simulator

External PostgreSQL: join methods Databases curriculum External Mode: SQL joins

Window functions & analytics

ROW_NUMBER, LAG, running totals, partitioned ranks. The difference between a CTE-and-subquery mess and three clean lines. Most analytics SQL that looks clever is just window functions used confidently.

External PostgreSQL: window functions

External Mode: window functions External SQL window functions explained

Reading query plans

EXPLAIN ANALYZE is the single highest-leverage skill in data work. A seq scan over 40M rows where you expected an index hit, the estimated-vs-actual rows blowing up tenfold — the plan tells you before the dashboard does.

Sim Query execution simulator

External PostgreSQL: using EXPLAIN Performance curriculum

Stage

Relational databases & modeling

Schemas, indexes, transactions — the engine behind the SQL.

A query is only as good as the schema it runs against and the indexes backing it. This stage is about the relational engine itself: how a B-tree index changes a 2-second scan into a 2ms lookup, what ACID actually guarantees, and how to model entities so the database does the work instead of fighting you.

Core

Database indexing

B-trees for range queries, hash for equality, partial and covering indexes for the cases that matter. The trap is over-indexing: every index is a tax on every write. Index the read path you actually have, not the one you imagine.

How database indexing works

Sim B-tree simulator Databases curriculum External use the index, Luke External PostgreSQL: indexes

Core

Transactions & ACID

Atomicity and durability are easy to want and hard to give you under failure. The real subtlety is isolation: most databases default to Read Committed, not Serializable, and the anomalies that leak through are the bugs nobody can reproduce.

How transactions work

Sim ACID simulator Sim Isolation levels External PostgreSQL: transaction isolation

Core

Normalization & schema design

Third normal form keeps your OLTP writes honest — one fact, one place. Then you denormalize on purpose for reads. Knowing when to break the rules is the whole skill; doing it by accident is the whole problem.

Databases curriculum

External PostgreSQL: data definition External Designing Data-Intensive Applications

The write-ahead log

Durability without fsync-on-every-row: write the intent to a sequential log first, apply later. It is the same trick under Postgres crash recovery, Kafka, and every LSM store — learn it once and it keeps reappearing.

How the WAL works

Sim WAL recovery simulator External PostgreSQL: WAL

Isolation levels in practice

Dirty reads, non-repeatable reads, phantoms, write skew. Each level trades anomalies for throughput. The interview question is "what breaks at Read Committed"; the production question is "why did two transfers both succeed."

Sim Isolation levels simulator

External PostgreSQL: transaction isolation External Designing Data-Intensive Applications

Stage

NoSQL & data stores

Key-value, wide-column, document — and when to reach for each.

NoSQL is not "no schema," it is "schema you enforce in the application instead of the engine." Each family makes a specific trade: Redis trades durability for latency, Cassandra trades joins for linear write scaling, document stores trade rigidity for shape. Pick by access pattern, not by hype.

Core

Key-value stores & Redis

An in-memory hash map you can talk to over the network, plus data structures. Cache, rate limiter, leaderboard, session store. The catch is memory is finite and durable-by-default it is not — decide your eviction and persistence on purpose.

How Redis works

Sim Redis operations simulator Redis persistence Sim Distributed cache simulator External Redis documentation

Core

Wide-column & Cassandra

You design the table around the query, partition key first, and you accept no joins. In return you get writes that scale linearly across nodes and survive a datacenter going dark. AP by default — tunable, but never free.

Cassandra replication

Sim Read/write quorum Sim Consistent hashing External Cassandra documentation

Quorums & tunable consistency

R + W > N is the whole game: pick how many replicas must ack a read and a write, and you slide along the consistency/availability line per query. Strong reads cost latency; eventual reads cost a stale answer now and then.

Sim Read/write quorum simulator

Sim CAP theorem Cassandra replication Paper Dynamo paper

Document & search stores

MongoDB and Elasticsearch let the shape live with the data instead of in a migration. Great when the model is fluid; a footgun when you reinvent joins in application code because you skipped modeling.

Databases curriculum

External MongoDB data modeling External Designing Data-Intensive Applications

Stage

Storage engines & file formats

LSM vs B-tree, row vs column, and how bytes hit disk.

Under every database is a storage engine making one big bet: optimize for writes or for reads. And under every data lake is a file format making another: row-oriented for transactions, column-oriented for scans. These two choices explain more about performance than any amount of query tuning.

Core

LSM trees vs B-trees

B-trees update in place — read-optimized, write-amplified. LSM trees append to memtables and compact later — write-optimized, with read amplification you fight using bloom filters. Most "fast write" databases are LSM under the hood.

Sim LSM tree simulator

Sim Storage engine simulator How the WAL works External Designing Data-Intensive Applications

Core

Columnar formats: Parquet & ORC

Store by column and a SELECT of three columns from a 200-column table reads three columns of disk, not all 200. Add run-length and dictionary encoding and analytics queries get cheaper by an order of magnitude. The default for any lake.

External Apache Parquet documentation

Sim Storage engine simulator External Spark: data sources Performance curriculum

Bloom filters

A probabilistic "is this key definitely absent?" check that costs a few bits per element. No false negatives, tunable false positives. It is what keeps an LSM read from touching every SSTable on disk for a key that was never written.

Sim Bloom filter simulator

Sim LSM tree simulator Databases curriculum

Compression & encoding

Snappy for speed, Zstd for ratio, dictionary encoding for low-cardinality columns. In columnar storage compression is not an afterthought — similar values sit next to each other, so the encoding choice is half your storage bill.

External Apache Parquet: encodings

External Spark: data sources Performance curriculum

Stage

Batch processing

Spark, the DAG, and moving compute to the data.

When the dataset stops fitting on one machine, you stop iterating over it and start describing a transformation graph that a cluster executes for you. Spark is the default. The whole game is understanding the partition — how data is split across executors — because every slow Spark job is a partitioning problem in disguise.

Core

Spark fundamentals & the DAG

You write transformations; nothing runs until an action forces it. Spark builds a lazy DAG, splits it at shuffle boundaries into stages, and schedules tasks per partition. Knowing where the stage boundaries fall is knowing where the cost is.

External Spark: programming guide

External Spark SQL guide Distributed systems curriculum External Learning Spark (free, Databricks)

Core

Partitioning & the shuffle

The shuffle is where Spark moves data across the network to colocate keys — and it is where jobs die. A wide transformation on a skewed key sends 90% of rows to one executor. Repartition, salt the key, or broadcast the small side.

Sim Database sharding simulator

External Spark: performance tuning Performance curriculum

Core

Joins at scale

Broadcast the small table and skip the shuffle entirely; otherwise it is a sort-merge across the cluster. The single biggest Spark win is recognizing when one side fits in memory and forcing the broadcast before the planner guesses wrong.

External Spark: performance tuning

Sim SQL joins simulator External Spark SQL guide

MapReduce & the lineage

Spark won, but MapReduce is the model underneath: map locally, shuffle by key, reduce. The paper still pays off — fault tolerance via re-computing lineage instead of replicating state is the idea every batch engine inherited.

Paper MapReduce paper (Google)

External Spark: programming guide External Designing Data-Intensive Applications

Stage

Streaming

Kafka, Flink, and data that never stops arriving.

Batch asks "what happened yesterday." Streaming asks "what is happening now," and that small change makes everything harder: events arrive late, out of order, or twice, and your window has to close before all the data shows up. Kafka is the durable log everyone agrees on; Flink and friends do the math on top.

Core

Kafka as the central log

Not a queue — a durable, partitioned, replayable log. Producers append, consumers track their own offset, the data sticks around. Once you see it as the source of truth other systems subscribe to, half of modern data architecture clicks.

How Kafka works

Kafka storage internals Sim LSM tree simulator External Kafka documentation

Core

Partitions, offsets & consumer groups

Ordering is guaranteed only within a partition, parallelism is capped by partition count, and a consumer group splits partitions across members. Rebalances stop the world briefly — design your partition count and keys for both.

Kafka storage internals

How Kafka works Message queues External Kafka: consumer docs

Core

Delivery semantics

Exactly-once is a marketing term; what you actually get is at-least-once plus idempotent writes. Decide whether a duplicate is survivable or catastrophic, then build for at-least-once and dedupe on a key — that is the honest version.

Message queues

External Kafka: delivery semantics How Kafka works

Stream processing & windows

Flink does stateful processing over unbounded streams: tumbling, sliding, and session windows with real fault-tolerant state. The hard part is always when to close a window when you do not know if more data is coming.

External Apache Flink documentation

External Flink: concepts External Kafka Streams docs

Stage

Data pipelines & orchestration

Airflow, dbt, and making jobs run in the right order, reliably.

A pipeline is a set of tasks with dependencies, run on a schedule, that must survive partial failure and rerun cleanly. Orchestration is how you express that graph and recover when step 4 of 9 dies at 3am. dbt owns the transform layer; Airflow and friends own the schedule and the retries.

Core

Orchestration with Airflow

A DAG of tasks, a scheduler, and a notion of retries and backfills. The mental shift is that the orchestrator should not move data — it should trigger the systems that do, and track whether they succeeded. Keep tasks idempotent or backfills bite.

External Airflow documentation

External Airflow: core concepts External Airflow: best practices

Core

Transformation with dbt

SQL plus version control, tests, and a dependency graph it builds for you. ELT, not ETL — load raw, transform in the warehouse, let the warehouse do the heavy lifting. The metrics layer and lineage you get for free are why analytics teams standardized on it.

External dbt documentation

External dbt: best practices External dbt: building models

Core

Idempotency & backfills

Every task you write should be safe to run twice and safe to run for last Tuesday. Write to partitions you fully overwrite, not appends you cannot undo. The pipelines that survive are the ones where rerunning is boring, not scary.

External Airflow: best practices

External dbt: incremental models System design curriculum

Scheduling & dependency graphs

Cron is fine until task B needs task A to have finished and you discover cron has no idea about dependencies. The whole reason orchestrators exist is to turn "run at 2am and hope" into "run when upstream is actually ready."

External Airflow: core concepts

External Airflow documentation System design curriculum

Stage

Data warehousing & lakehouse

Snowflake, BigQuery, and the table format that merged lake and warehouse.

The warehouse is where data goes to be queried by humans and dashboards — columnar, separated storage from compute, and modeled for analytics not transactions. The lakehouse is the newer move: keep cheap object storage but bolt warehouse guarantees on top with an open table format. This stage is both worlds.

Core

Cloud warehouses

Snowflake and BigQuery decoupled storage from compute so you scale them independently and pay for query, not for idle. The model is simple to use and easy to bankrupt yourself on — a missing partition filter is a full-table scan with a bill.

External Snowflake documentation

External BigQuery documentation External BigQuery: best practices

Core

Dimensional modeling

Star schema: one fact table of events, dimension tables of context around it. Kimball over Inmon for most teams. Slowly-changing dimensions are where it gets real — do you overwrite the old value or keep history? The answer changes the whole model.

Databases curriculum

External dbt: dimensional modeling External Kimball: dimensional modeling techniques

Lakehouse table formats

Iceberg, Delta, Hudi put ACID transactions, schema evolution, and time travel on top of Parquet files in object storage. The point is you stop choosing between cheap lake and reliable warehouse — you get atomic commits on S3.

External Apache Iceberg documentation

External Delta Lake documentation External Apache Parquet documentation

Partitioning & clustering the warehouse

Partition by the column you filter on — usually date — so the engine prunes instead of scans. Cluster by the column you group on. Get these two right and a query goes from terabytes scanned to gigabytes; get them wrong and no amount of compute saves you.

Sim Database sharding simulator

External BigQuery: partitioning External Snowflake: micro-partitions

Stage

Data quality, governance & observability

Tests, contracts, and lineage — catching bad data early.

A pipeline that runs green while producing garbage is worse than one that fails loudly. This stage is the discipline that separates a data platform you trust from a pile of jobs you babysit: tests on the data itself, contracts at the boundaries, lineage to trace the blast radius, and observability to catch silent drift.

Core

Data testing & assertions

Not unit tests on code — assertions on the data: row counts in range, no nulls in the key, referential integrity intact. dbt tests and Great Expectations turn "I think the data is fine" into a check that fails the build when it is not.

External dbt: tests

External Great Expectations docs External dbt: best practices

Data contracts

The schema is an API between the producer and every downstream consumer. A contract makes that explicit: change the upstream shape and CI fails before it breaks fifty dashboards. The fix is social as much as technical — someone has to own the boundary.

System design curriculum

External dbt: model contracts External Designing Data-Intensive Applications

Lineage & cataloging

When a number is wrong, the first question is "what feeds this," and without lineage the answer is a Slack archaeology dig. A catalog maps table-to-table dependencies so you can trace upstream to the source and downstream to the blast radius.

External dbt: lineage

External OpenLineage docs External Airflow documentation

Observability & freshness

The dashboard is two days stale and nobody noticed — that is the failure mode observability catches. Track freshness, volume, and distribution over time, and alert on the anomaly, not just on the job exit code. A green DAG is not a healthy table.

External dbt: source freshness

External Great Expectations docs System design curriculum

Stage

Distributed data & partitioning

Sharding, replication, consensus — when data outgrows one box.

Every system in this roadmap eventually splits data across machines, and the same handful of ideas decide whether it works: how you partition keys so load spreads evenly, how you replicate so a dead node is not a dead service, and how nodes agree on the truth. This is the distributed-systems core that data engineering keeps reusing.

Core

Sharding & partition keys

Split the data, route by key, and pray the key distributes evenly. A bad shard key gives you one hot partition doing all the work while the rest idle. The choice is permanent in practice — resharding a live system is its own engineering project.

Sim Database sharding simulator

Distributed systems curriculum External Designing Data-Intensive Applications

Core

Consistent hashing

Naive modulo sharding remaps almost every key when you add a node. Consistent hashing on a ring moves only the keys near the new node. It is what makes Cassandra, DynamoDB, and most distributed caches able to scale without a full reshuffle.

Sim Consistent hashing simulator

Cassandra replication Distributed systems curriculum

Replication strategies

Leader-follower is simple and lags; multi-leader and leaderless trade simplicity for availability and bring conflict resolution as homework. Pick by whether you can tolerate stale reads — and remember replication lag is a feature you measure, not a bug you fix.

Cassandra replication

Sim Read/write quorum External Designing Data-Intensive Applications

Distributed ID generation

Auto-increment dies the moment you shard. Snowflake IDs pack a timestamp, a machine ID, and a sequence into 64 bits so every node mints unique, roughly-sortable IDs without coordinating. The classic interview problem with a clean answer.

Distributed IDs

Distributed IDs / Snowflake Distributed systems curriculum

Stage

ML fundamentals

Enough math and training loop to know what you are shipping.

You do not need to derive backprop to ship ML, but you need to know what a model is doing well enough to debug it in production. This stage is the working mental model: supervised vs unsupervised, what training actually optimizes, why a model that scores 99% offline can still be useless, and how text becomes vectors a model can read.

Core

The learning loop

Data in, loss measured, parameters nudged downhill, repeat. Gradient descent is the engine under nearly everything. The intuition that matters: the model is fitting a function to examples, and it will happily fit the noise if you let it.

External scikit-learn: ML basics

External scikit-learn user guide External Google: ML crash course

Core

Supervised vs unsupervised

Labeled data and a target to predict, or unlabeled data and structure to find. Classification, regression, clustering, dimensionality reduction. Most of the value at most companies is boring supervised learning on tabular data, not the flashy stuff.

External scikit-learn: supervised learning

External scikit-learn: unsupervised learning External Google: ML crash course

Core

Overfitting, bias & variance

A model that nails the training set and flops on new data has memorized, not learned. The bias-variance trade is the whole tension: too simple and it underfits, too flexible and it overfits. Regularization and a held-out test set keep you honest.

External scikit-learn: cross-validation

External scikit-learn: model evaluation External Google: ML crash course

Embeddings & vectors

Turn words, users, or items into points in a high-dimensional space where "close" means "similar." It is the representation under search, recommendations, and every RAG pipeline. Cosine similarity on good embeddings does a shocking amount of work.

Sim Vector embedding simulator

External scikit-learn: feature extraction Sim Vector embeddings

Evaluation metrics

Accuracy is a trap on imbalanced data — predict "no fraud" every time and score 99%. Precision, recall, F1, AUC, and a confusion matrix tell you what the model actually does. Pick the metric that maps to the cost of being wrong.

External scikit-learn: model evaluation

External scikit-learn: classification metrics External Google: ML crash course

Stage

Feature engineering & feature stores

Where a lot of the accuracy comes from, and serving it consistently.

In real systems, better features beat fancier models almost every time — and the hard part is not computing them, it is serving the exact same feature to training (batch, historical) and inference (online, low-latency) without skew. The feature store exists to solve that one brutal consistency problem.

Core

Feature engineering

Encoding categoricals, scaling numerics, building interactions and aggregates that expose signal the model cannot find raw. Unglamorous and decisive — the team that engineers features carefully usually beats the team chasing a bigger architecture.

External scikit-learn: preprocessing

External scikit-learn: feature extraction External Google: ML crash course

Core

Training/serving skew

The model trained on yesterday's batch-computed feature and gets served a slightly different online-computed one — same name, different value — and accuracy quietly tanks. This skew is the single most common silent ML production bug. Compute the feature once, read it twice.

External Feast: feature store concepts

External Google: ML crash course System design curriculum

Core

Online vs offline stores

Offline store is the warehouse — big, historical, feeds training. Online store is a low-latency key-value layer — Redis-shaped, feeds inference at p99. The feature store keeps them in lockstep so the same feature means the same thing in both.

How Redis works

External Feast documentation Sim Distributed cache simulator

Point-in-time correctness

When you build training data, every feature must reflect what was known at that moment — not the future. Join a label from Tuesday to a feature computed Thursday and you leak the answer into training. Point-in-time joins are fiddly and non-negotiable.

External Feast: point-in-time joins

Sim SQL joins simulator External Feast: feature store concepts

Stage

Model training & experimentation

Frameworks, tracking, and making a result reproducible.

Training a model once in a notebook is easy. Training it again next month and getting the same number — same data, same seed, same hyperparameters, tracked — is the part that makes ML an engineering discipline instead of a science-fair project. This stage is the frameworks plus the experiment hygiene around them.

Core

Training frameworks

scikit-learn for classic tabular, PyTorch or TensorFlow for deep nets. Start with the simplest thing that fits the problem — a gradient-boosted tree beats a neural net on most tabular data and trains in seconds, not GPU-hours.

External PyTorch documentation

External scikit-learn user guide External TensorFlow guide

Core

Experiment tracking

Log every run's params, metrics, and artifacts so "which version of this model is in prod and what trained it" has an answer. MLflow turns a folder of mystery notebooks into a queryable history. Untracked experiments are experiments you cannot trust.

External MLflow: tracking

External MLflow documentation External Weights & Biases docs

Hyperparameter tuning

Grid search is exhaustive and wasteful, random search is shockingly competitive, Bayesian methods are smarter on expensive runs. The real win is a clean cross-validation loop so the tuned number actually generalizes instead of overfitting the validation set.

External scikit-learn: hyperparameter tuning

External scikit-learn: cross-validation External MLflow documentation

Reproducibility

Pin the data version, the seed, the library versions, the environment. The result you cannot reproduce is a rumor, not a finding — and six months later "we got 94% once" helps nobody. Version data like you version code.

External MLflow: projects

External DVC documentation System design curriculum

Stage

Model serving & MLOps

Getting a model live, keeping it healthy, and knowing when it rots.

A trained model is a binary doing nothing until it is behind an endpoint answering requests under a latency budget — and then the hard part starts. The world drifts, the model goes stale, and you need to notice before the business does. MLOps is the production discipline: serve it, monitor it, and retrain it on a loop.

Core

Model serving

Wrap the model in an API behind a load balancer and meet a p99 latency budget. Batch inference for offline scoring, online for real-time. The same scaling, caching, and tail-latency problems as any backend service — the model is just an unusually heavy function.

System design curriculum

External MLflow: model serving External TensorFlow Serving

Core

Model registry & deployment

A registry versions models like a package repo: staging, production, rollback. Deploy with shadow traffic or a canary so a bad model fails to a few users, not all of them. Treat the model artifact as a release, with the same discipline as code.

External MLflow: model registry

External MLflow: models System design curriculum

Core

Monitoring & drift detection

The model does not crash when it goes wrong — it quietly gets less right as the input distribution drifts away from training. Monitor prediction distributions and feature stats, not just latency. The scariest ML failure is the one with a green dashboard.

External Google: ML production monitoring

External MLflow documentation System design curriculum

Retraining pipelines

A model in production is a perishable asset. Wire the training job into the same orchestrator as your data pipelines so retraining is scheduled, tested, and gated — not a hero running a notebook the week accuracy fell off a cliff.

External Airflow documentation

External MLflow: pipelines System design curriculum

All paths

All roadmaps

Backend, system design, frontend, DevOps, security, DSA and data — the full set, in one place.

Open

Hands-on

Run the simulators

LSM-trees, sharding, quorums, SQL joins, vector embeddings — interactive, in the browser.

Open

Go deeper

The codex

Long-form curricula behind the links: databases, distributed systems, performance.

Open

Decisions

The handbook

When to shard, when to add a queue, what to cache — the decision rules.

Open