14 stages · 70 topics · 34 core
Roadmap

Become a data engineer.

All stages, in order. The full arc from SQL fundamentals through storage engines, batch and streaming pipelines, the warehouse, and out the far end into feature stores and model serving. This is the broad track: you end up able to land data anywhere it needs to go and reason about every layer it passes through. Each topic links to a Semicolony deep dive or simulator where one exists, and to a curated external resource where it doesn't. Follow the arc in order, or jump to wherever you're stuck.

FOUNDATIONSSTORAGEPIPELINESSERVINGML 01 02 03 04 05 06 07 08 09 10 11 12 13 14startdata engineer
Core (the spine) Recommended (strong upside) Optional (pick if relevant)

Path
Level

Core plus the recommended layer. The optional stops stay hidden until you have shipped a couple of real systems.


Jump to a stage

01
Stage

Data foundations & SQL

The language all data work starts with, and the model under it.

Everything downstream is built on relational thinking, even the systems that claim to have escaped it. Before Spark or feature stores mean anything, you need SQL in your fingers and a real picture of how the engine turns a declarative query into a plan: scans, joins, aggregations, and the order they actually run in.

02
Stage

Relational databases & modeling

Schemas, indexes, transactions — the engine behind the SQL.

A query is only as good as the schema it runs against and the indexes backing it. This stage is about the relational engine itself: how a B-tree index changes a 2-second scan into a 2ms lookup, what ACID actually guarantees, and how to model entities so the database does the work instead of fighting you.

03
Stage

NoSQL & data stores

Key-value, wide-column, document — and when to reach for each.

NoSQL is not "no schema," it is "schema you enforce in the application instead of the engine." Each family makes a specific trade: Redis trades durability for latency, Cassandra trades joins for linear write scaling, document stores trade rigidity for shape. Pick by access pattern, not by hype.

04
Stage

Storage engines & file formats

LSM vs B-tree, row vs column, and how bytes hit disk.

Under every database is a storage engine making one big bet: optimize for writes or for reads. And under every data lake is a file format making another: row-oriented for transactions, column-oriented for scans. These two choices explain more about performance than any amount of query tuning.

05
Stage

Batch processing

Spark, the DAG, and moving compute to the data.

When the dataset stops fitting on one machine, you stop iterating over it and start describing a transformation graph that a cluster executes for you. Spark is the default. The whole game is understanding the partition — how data is split across executors — because every slow Spark job is a partitioning problem in disguise.

06
Stage

Streaming

Kafka, Flink, and data that never stops arriving.

Batch asks "what happened yesterday." Streaming asks "what is happening now," and that small change makes everything harder: events arrive late, out of order, or twice, and your window has to close before all the data shows up. Kafka is the durable log everyone agrees on; Flink and friends do the math on top.

07
Stage

Data pipelines & orchestration

Airflow, dbt, and making jobs run in the right order, reliably.

A pipeline is a set of tasks with dependencies, run on a schedule, that must survive partial failure and rerun cleanly. Orchestration is how you express that graph and recover when step 4 of 9 dies at 3am. dbt owns the transform layer; Airflow and friends own the schedule and the retries.

08
Stage

Data warehousing & lakehouse

Snowflake, BigQuery, and the table format that merged lake and warehouse.

The warehouse is where data goes to be queried by humans and dashboards — columnar, separated storage from compute, and modeled for analytics not transactions. The lakehouse is the newer move: keep cheap object storage but bolt warehouse guarantees on top with an open table format. This stage is both worlds.

09
Stage

Data quality, governance & observability

Tests, contracts, and lineage — catching bad data early.

A pipeline that runs green while producing garbage is worse than one that fails loudly. This stage is the discipline that separates a data platform you trust from a pile of jobs you babysit: tests on the data itself, contracts at the boundaries, lineage to trace the blast radius, and observability to catch silent drift.

10
Stage

Distributed data & partitioning

Sharding, replication, consensus — when data outgrows one box.

Every system in this roadmap eventually splits data across machines, and the same handful of ideas decide whether it works: how you partition keys so load spreads evenly, how you replicate so a dead node is not a dead service, and how nodes agree on the truth. This is the distributed-systems core that data engineering keeps reusing.

11
Stage

ML fundamentals

Enough math and training loop to know what you are shipping.

You do not need to derive backprop to ship ML, but you need to know what a model is doing well enough to debug it in production. This stage is the working mental model: supervised vs unsupervised, what training actually optimizes, why a model that scores 99% offline can still be useless, and how text becomes vectors a model can read.

12
Stage

Feature engineering & feature stores

Where a lot of the accuracy comes from, and serving it consistently.

In real systems, better features beat fancier models almost every time — and the hard part is not computing them, it is serving the exact same feature to training (batch, historical) and inference (online, low-latency) without skew. The feature store exists to solve that one brutal consistency problem.

13
Stage

Model training & experimentation

Frameworks, tracking, and making a result reproducible.

Training a model once in a notebook is easy. Training it again next month and getting the same number — same data, same seed, same hyperparameters, tracked — is the part that makes ML an engineering discipline instead of a science-fair project. This stage is the frameworks plus the experiment hygiene around them.

14
Stage

Model serving & MLOps

Getting a model live, keeping it healthy, and knowing when it rots.

A trained model is a binary doing nothing until it is behind an endpoint answering requests under a latency budget — and then the hard part starts. The world drifts, the model goes stale, and you need to notice before the business does. MLOps is the production discipline: serve it, monitor it, and retrain it on a loop.