07 / 09

GCP / 07

BigQuery

BigQuery is what happens when a database gives up on indexes, gives up on servers you can name, and bets everything on one idea: if you store columns instead of rows and can throw a few thousand workers at a scan for a few seconds, you never need to seek — you just read less. This page goes under the hood: the Dremel lineage, Colossus and Capacitor, what a slot actually is, how a query becomes a tree of stages joined by shuffles, why partitioning and clustering replaced indexing, what the per-terabyte bill really measures, and a bq lab at the end that proves the pruning with real byte counts.

Where BigQuery comes from

In 2010, Google published a paper called Dremel: Interactive Analysis of Web-Scale Datasets. Dremel had been running inside Google since 2006 as the tool engineers used to run ad-hoc SQL over trillion-row log tables and get answers in seconds, not the hours a MapReduce job took. The paper described two ideas that have shaped analytics ever since. The first was a columnar storage format for nested records — the repetition-and-definition-level encoding that later became the basis of Parquet. The second was an execution model that fanned a single query out across thousands of machines at once, each reading a small piece of the data, with results aggregated back up a tree.

BigQuery is the productised descendant of that system. Google announced it in 2010, opened it up in 2011, and made it generally available in 2012, and it has stayed unusual among cloud warehouses in one specific way: it is serverless. There is no cluster to create, size, start, or stop. You send SQL to an endpoint; the service decides how many workers the query gets, runs it, and bills you for either the bytes it read or the compute time it consumed. Compare that with the classic warehouse model — provision a cluster, load data into it, watch it sit idle at night — and the difference is structural, not cosmetic. Nothing in BigQuery is "yours" except the data and the bill.

That model has consequences in both directions. You never pay for idle, you never capacity-plan for a spiky Monday-morning dashboard load, and a brand-new project can query a petabyte-scale public dataset thirty seconds after it is created. In exchange, you give up the knob: you cannot pick instance types, you cannot pin a query to hardware, and the units you reason about are bytes scanned and slots consumed rather than machines. Most of this page is about learning to think in those units.

The architecture: storage, compute, and the network between them

BigQuery separates storage from compute completely. Not "compute nodes with attached disks plus an object-store tier" — completely. Table data lives in Colossus, Google's distributed file system and the successor to GFS, which is the same storage layer that sits underneath Cloud Storage. Queries run on workers that own no data at all. The two halves are connected by Jupiter, Google's datacenter network fabric, which the 2015 paper described as delivering around a petabit per second of bisection bandwidth inside a datacenter. That number is the load-bearing fact of the whole design: when any worker can read any byte of storage at something close to local-disk speed, data locality stops being a constraint, and you can schedule compute wherever it happens to be free.

On the storage side, tables are encoded in Capacitor, the columnar format that replaced the original ColumnIO format from the Dremel paper in 2016. Each column is stored separately, dictionary- and run-length-encoded, with statistics kept per block. Capacitor goes further than most formats: during encoding it evaluates different row orderings and column-specific encodings to find a layout that compresses well for the actual data, because Google's economics say it is worth spending extra CPU once at write time to make every future read cheaper. Files are immutable. A write produces new files; a background storage optimizer continuously compacts, re-sorts, and re-encodes them. Deletes and updates are handled with the same machinery, which is also what makes time travel — querying a table as it was up to seven days ago — nearly free to offer.

On the compute side sit slots, which get their own section below, carved out of Borg, Google's cluster manager. The workers are stateless in the sense that matters: nothing about your table lives with them, so the scheduler can hand your query two hundred workers this second and a different two hundred the next, and nothing breaks.

Full disaggregation. Slots own no data; Colossus runs no queries; Jupiter makes remote reads fast enough that the split costs almost nothing.

The separation pays off in ways you feel directly. Storage is priced like object storage, a couple of cents per GB, and grows without anyone resizing anything. Many teams can query the same table at once without copies, because no compute node "owns" it. And a table shared across an organisation is just a grant, not a data transfer.

What a slot actually is

A slot is BigQuery's unit of compute capacity: roughly one worker thread with a slice of CPU and memory attached. It is not a VM you can inspect, and not a container you can size. It is closer to a scheduling token — the right to have one unit of work executing at a given moment. When the documentation says a query "used 4,000 slot-seconds", it means the equivalent of 4,000 of those workers running for one second, or 40 running for 100 seconds; the scheduler decides the shape.

Scheduling works like this. The query planner compiles your SQL into stages (next section), and each stage breaks into many independent tasks — typically one per chunk of input data. Tasks queue up, and a fair scheduler assigns them to slots from your project's pool. An on-demand project draws from a shared pool with a soft ceiling of about 2,000 concurrent slots, with transient bursts above that when the service has spare capacity. The fairness is two-level: concurrent queries within a project get even shares of the pool, and stages within a query get slots according to how much runnable work they have. Nothing about this is static. As a downstream stage starts producing work, slots drain from the finished upstream stage and move to it, query by query, second by second.

Two properties of this design are worth holding onto. First, slot shortage degrades, never fails: if a query has 10,000 runnable tasks and 500 available slots, the tasks queue and the query simply takes longer. There is no "cluster too small" error. Second, the planner is adaptive during execution. BigQuery can change a stage's parallelism mid-flight, and if it detects skew — one shuffle partition far larger than the rest — it can split that partition and re-spread the work. You see the evidence in the query plan afterwards: stages report how their time split across waiting, reading, computing, and writing, which is the first place to look when a query is slower than it should be.

Slots are the capacity currency. On-demand pricing hides them (you pay per byte and get the shared pool); capacity pricing makes them explicit (you reserve a number of slots and queries compete for them). Either way, the slot-seconds a query consumed is the honest measure of how heavy it was, and the INFORMATION_SCHEMA.JOBS views report it per query.

How a query runs: the execution tree

The 2010 paper described a static serving tree: leaf servers scanned data, mixers aggregated partial results up level by level, and the root returned the answer. That shape is excellent for one-pass aggregations — count, sum, group-by with modest cardinality — and awkward for everything else, joins in particular. Modern BigQuery (described in the 2020 follow-up paper, Dremel: A Decade of Interactive SQL Analysis at Web Scale) runs a more general model: a DAG of stages connected by a shuffle.

A typical aggregation query becomes three stages. The first stage's tasks each read one chunk of Capacitor files — only the columns the query references — apply the filters, and compute partial aggregates close to the data. Their output is repartitioned by grouping key through the shuffle, so that every row with the same key lands in the same place. The second stage merges those partials per key. A final stage gathers the result. Joins follow the same pattern with both inputs shuffled on the join key (or, when one side is small, broadcast to every task instead, skipping the shuffle entirely).

Bottom to top: scan tasks read pruned columns and pre-aggregate, the shuffle regroups rows by key, merge tasks finish the aggregation, a final stage returns the result.

The shuffle deserves its own sentence, because it is the piece that makes the stateless workers possible. Rather than workers streaming intermediate data directly to each other, BigQuery writes shuffle output to a separate, disaggregated in-memory tier (built on the distributed-memory shuffle infrastructure Google described in 2014, spilling to disk for very large jobs). The shuffle acts as a checkpoint between stages: if a worker dies, only its in-flight task is retried; if the scheduler wants to rescale a stage, it can, because the inputs are sitting safely in the shuffle tier, not in some other worker's memory.

This also tells you where query performance goes to die. Scans parallelise almost perfectly; shuffles do not. A join of two large tables on a high-cardinality key, or a GROUP BY over hundreds of millions of distinct keys, is shuffle-bound, and the fix is almost always to move less data into the shuffle: filter earlier, pre-aggregate before joining, or restructure so the big table is shuffled once instead of twice.

Why there are no indexes

BigQuery has no CREATE INDEX. No B-trees, no secondary indexes, no index maintenance, no query planner choosing between index paths. (Search indexes for text lookups arrived in 2022, but they are a niche feature, not the access path for analytics.) This is a deliberate bet, and it is worth understanding the arithmetic behind it rather than treating it as a quirk.

An index earns its keep when you need a few rows out of many and the alternative is reading everything. A warehouse workload is the opposite shape: most queries touch a large fraction of rows in a few columns — sum revenue by day, count events by type, join facts to dimensions. For that shape, columnar storage plus brute-force parallelism wins. The scan reads only the referenced columns, those columns are run-length and dictionary compressed, and two thousand slots each pulling hundreds of MB per second through Jupiter chew through a terabyte-scale column set in seconds. Meanwhile every index you might have built has to be updated on every load, slows ingestion, takes storage, and helps only the queries it anticipated. Dremel's authors looked at that trade and removed the feature.

But "scans are cheap" is a statement about latency, not about money. On-demand pricing charges per byte read, so a full scan that finishes in four seconds can still be an expensive four seconds if the table is large and you run it every five minutes. So BigQuery gives you two tools that do the useful part of an index's job — skipping data — without the write-time costs: partitioning and clustering. They do not find rows; they let the engine refuse to read the parts of the table that cannot match.

Partitioning and clustering: pruning instead of indexing

Partitioning splits a table into physically separate segments on one column: a date or timestamp column (at daily, hourly, monthly, or yearly granularity), ingestion time, or an integer range. Each partition is its own set of files with its own metadata. When a query filters on the partitioning column, the planner eliminates non-matching partitions before any slot reads a byte — the pruning happens at planning time, from metadata alone, and the dry-run estimate already reflects it. A table is capped at 10,000 partitions, which is why the standard choice is daily: about 27 years of headroom. If this layout sounds familiar, it should — it is the same coarse time-bucketing that time-series stores use, for the same reason: almost every query carries a time range, so time is the cheapest dimension to prune on.

Clustering works inside partitions. You name up to four columns, and BigQuery keeps the rows of each partition sorted by them, organised into blocks that carry min/max metadata per clustered column. A filter on a prefix of the clustering columns lets the scan skip every block whose range cannot match. Because blocks are not a fixed set of segments, clustering handles what partitioning cannot: high-cardinality columns. You cannot partition on customer_id with ten million customers, but you can cluster on it, and a WHERE customer_id = … then reads a sliver of each day instead of the whole day. New writes land unsorted at first; a background process re-clusters continuously, and unlike most databases' index maintenance, it is automatic and free.

Two pruning passes. The partition filter cuts the table to one day at plan time; cluster block ranges cut that day to the blocks that can contain station 111.

The two compose, and the composition has a practical asymmetry you should know before the lab: a dry run accounts for partition pruning but not cluster pruning. Partition elimination is decided from metadata before execution, so the estimate includes it; block skipping happens during the scan, so the estimate is an upper bound and the billed bytes of the actual run can come in lower. If a table must never be scanned whole, set require_partition_filter on it and BigQuery rejects any query that omits the partition column.

Choosing the layout is mostly answering one question: what does nearly every query filter on? That column is your partition column, and for event-shaped data it is nearly always the date. The next two or three most common filters become clustering columns, most selective first. The mistake to avoid is over-partitioning — integer-range partitions on a high-cardinality key produce thousands of tiny partitions, each with file overhead and minimum billing granularity, where clustering alone would have been both faster and cheaper. A reasonable rule: if a partition would hold under a gigabyte, you probably wanted clustering instead.

What you pay for: bytes or slots

On-demand is the default: $6.25 per TB of data read, with the first TB each month free and a 10 MB minimum per table touched. "Read" means the logical (uncompressed) bytes of the columns your query references, after partition pruning. The model is wonderfully legible — every query has a price you can see before you run it with a dry run — and it has one famous foot-gun: SELECT *. Because storage is columnar, you pay per column, and * names every column. Adding LIMIT 10 changes nothing, since all the columns must still be scanned to produce any ten rows. On a wide events table, SELECT * LIMIT 10 can bill terabytes for ten rows of output. The free table preview in the console exists precisely so nobody pays for curiosity.

Capacity pricing (the "editions": Standard, Enterprise, Enterprise Plus, which replaced the older flat-rate plans in 2023) inverts the model: you reserve slots and pay per slot-hour, with optional autoscaling between a baseline and a max, and one- or three-year commitments for a discount. Queries then cost nothing per byte; they compete for your reserved slots. The break-even arithmetic is straightforward in shape: steady, predictable workloads that scan tens of terabytes a day are cheaper on reserved slots, while spiky or exploratory workloads are cheaper on-demand because you pay nothing between queries. Many organisations run both — a reservation for the production pipelines, with ad-hoc analysis left on-demand.

Storage is billed separately and modestly: active storage at roughly object-storage prices, dropping by about half for any table or partition untouched for 90 days, with a choice between logical (uncompressed) and physical (compressed) billing models per dataset. The practical effect is that keeping years of history in BigQuery is cheap; what costs money is scanning it carelessly, which is exactly what partitioning protects you from.

Getting data in: loads and the Storage Write API

Batch loads are free and do not consume your query capacity — a load job from Cloud Storage in Parquet, ORC, Avro, CSV, or JSON runs on a shared pool and simply appends new Capacitor files to the table. For anything that cannot wait for a batch, there is the Storage Write API: a gRPC interface that accepts protobuf rows over long-lived streams. It replaced the legacy insertAll streaming endpoint, which was HTTP, JSON, more expensive per GB, and only best-effort about duplicates. The Write API offers real semantics: the default stream gives at-least-once delivery at high throughput, while application-created streams with explicit offsets give exactly-once appends, and pending-mode streams let you buffer a batch and commit it atomically — all of it visible to queries within seconds.

Under the hood, streamed rows land first in a row-oriented, write-optimised buffer and are query-able immediately; a background process converts them into Capacitor files shortly after. You never see the seam — a query just reads both representations — but it explains why freshly streamed data is slightly more expensive for the engine to scan, and why clustering takes a little while to "catch up" on data that just arrived.

Making repeated reads cheaper: materialized views and BI Engine

Pruning helps queries that filter; materialized views help queries that repeat. A materialized view precomputes an aggregation over a base table and — this is the part that matters — is maintained incrementally: when new rows land in the base table, BigQuery combines the stored view with just the delta rather than recomputing. Better still, the optimizer does automatic query rewrite: a query against the base table that the view can answer is silently redirected to it, so dashboards get the savings without anyone editing SQL. The price of incremental maintenance is a restricted SQL surface — aggregations, inner joins, no arbitrary window functions — so they fit the "same rollup, queried hundreds of times a day" pattern rather than general transformation.

BI Engine attacks the same problem from the memory side. It is an in-memory columnar cache, reserved by the gigabyte per project, that sits in front of BigQuery and serves eligible queries with vectorised execution at interactive latency. It exists because dashboard traffic has a brutal profile — hundreds of small, repetitive queries over a recent slice of data — and running each as a full distributed query wastes both time and slots. Point a reservation at the hot few GB and the dashboards stop being a tax on the warehouse.

Querying data you don't own: federation and BigLake

BigQuery can also run SQL over data that never went through a load job. External tables point at files in Cloud Storage (Parquet, ORC, Avro, CSV, JSON), at Bigtable, or even at Google Sheets; federated queries push SQL down into Cloud SQL and Spanner and pull results back. The convenience is real and so is the cost: external data has no Capacitor encoding, no block statistics, and limited pruning (Hive-style partition layouts on GCS paths help), so the same query runs slower and reads more than it would against a native table. Federation is the right tool for occasional joins against operational data, not for the hot path.

BigLake is the newer, more serious version of the idea: tables over object storage that behave like managed ones — row- and column-level security enforced by the table rather than by who can read the bucket, metadata caching for better pruning, and support for open formats including Apache Iceberg. The same BigLake table is readable from BigQuery and from Spark or other engines without copies. The direction of travel is plain: the boundary between "warehouse" and "data lake" is being dissolved deliberately, with the storage staying open-format and the engines becoming interchangeable on top.

BigQuery and Snowflake at a glance

The two are the obvious comparison, and the architectures rhyme: both separate storage from compute, both are columnar, neither has user-managed indexes. The fork in the road is the compute model.

	BigQuery	Snowflake
Compute	Serverless slot pool; the scheduler decides parallelism per query	Virtual warehouses you size (XS–6XL), start, stop, and pay for while running
Pricing	Per TB scanned (on-demand) or reserved slot-hours (editions)	Per-second credits while a warehouse is on, regardless of bytes
Skipping data	Explicit partitioning + clustering, declared by you	Automatic micro-partitions (~tens of MB) with min/max pruning; optional clustering keys
Streaming	Storage Write API, exactly-once streams, seconds to visibility	Snowpipe / Snowpipe Streaming
Isolation	Queries share a pool; fairness by scheduler	Hard isolation per warehouse — one team's scan cannot slow another's
Where it runs	GCP only	AWS, Azure, and GCP

The honest one-line version: Snowflake gives you the capacity knob with per-second billing and clean isolation, and asks you to manage it; BigQuery removes the knob entirely and asks you to think in bytes instead. Which trade is better depends on workload shape and on how much your organisation enjoys managing warehouses, and the long version of that argument — Databricks included — lives in the comparison page.

Lab: prove the pruning with bq

Everything above can be verified from a terminal in about ten minutes with the bq CLI (part of the gcloud SDK). The lab queries a public dataset, builds a partitioned and clustered copy of another, and shows the bytes-scanned difference with and without a partition filter — the numbers that turn the pruning diagram into something you have seen with your own eyes. Total data processed is a few GB, comfortably inside the monthly 1 TB free tier.

Price a query before running it. The --dry_run flag asks the planner for the byte estimate without executing anything, and it is free.
bq query --use_legacy_sql=false --dry_run \ 'SELECT name, SUM(number) AS total FROM `bigquery-public-data.usa_names.usa_1910_2013` GROUP BY name ORDER BY total DESC LIMIT 10' # "... running this query will process 65662975 bytes of data." # ~65 MB: two columns of a 6M-row table. Note LIMIT changed nothing.
Run it for real. Same command without the flag. Watch how fast a distributed scan-shuffle-merge over the whole table comes back.
bq query --use_legacy_sql=false \ 'SELECT name, SUM(number) AS total FROM `bigquery-public-data.usa_names.usa_1910_2013` GROUP BY name ORDER BY total DESC LIMIT 10'
Create a dataset, then a partitioned + clustered table. The London cycle-hire public dataset lives in the EU multi-region, so the destination dataset must too. The CTAS statement below is the whole layout declaration: daily partitions on the ride's start time, clustered by start station.
bq --location=EU mk -d bq_lab bq query --use_legacy_sql=false \ 'CREATE TABLE bq_lab.hires PARTITION BY DATE(start_date) CLUSTER BY start_station_id AS SELECT rental_id, duration, bike_id, start_date, start_station_id, end_station_id FROM `bigquery-public-data.london_bicycles.cycle_hire`'
Dry-run the same question with and without a partition filter. Both queries compute the same average for one station; only the second tells the planner which days it needs.
# A. no partition filter — estimate covers every partition bq query --use_legacy_sql=false --dry_run \ 'SELECT AVG(duration) FROM bq_lab.hires WHERE start_station_id = 111' # processes the duration + start_station_id columns of the WHOLE table # (a few hundred MB) # B. partition filter — planner prunes to ~30 of ~2000 daily partitions bq query --use_legacy_sql=false --dry_run \ 'SELECT AVG(duration) FROM bq_lab.hires WHERE start_date BETWEEN TIMESTAMP("2016-06-01") AND TIMESTAMP("2016-06-30") AND start_station_id = 111' # estimate drops by roughly the ratio of days kept to days total — # tens of MB instead of hundreds
Run query B and read the billed bytes. Give the job a name so you can inspect it, then pull the statistics. This is where clustering shows up: the dry run could not account for block skipping, so the actual bytes can land below the estimate.
bq query --use_legacy_sql=false --job_id=hires_pruned_1 \ 'SELECT AVG(duration) FROM bq_lab.hires WHERE start_date BETWEEN TIMESTAMP("2016-06-01") AND TIMESTAMP("2016-06-30") AND start_station_id = 111' bq show --format=prettyjson -j hires_pruned_1 | grep -i bytes # totalBytesProcessed / totalBytesBilled — compare against the dry-run # estimate from step 4B and against 4A's full-table number
Tear down. One command removes the dataset and the table inside it. There is no cluster to stop, which is rather the point.
bq rm -r -f -d bq_lab

What you just measured. Step 4A is the no-index world without pruning: every query pays for the full column set. Step 4B is partition elimination, visible at plan time. Step 5 is cluster-block skipping, visible only after execution. Together they are the entire replacement for indexing — and unlike an index, they cost nothing on write.

BigQuery

Where BigQuery comes from

The architecture: storage, compute, and the network between them

What a slot actually is

How a query runs: the execution tree

Why there are no indexes

Partitioning and clustering: pruning instead of indexing

What you pay for: bytes or slots

Getting data in: loads and the Storage Write API

Making repeated reads cheaper: materialized views and BI Engine

Querying data you don't own: federation and BigLake

BigQuery and Snowflake at a glance

Lab: prove the pruning with bq

Further reading

08 — Spanner