08 / 09
Azure / 08

Cosmos DB

Cosmos DB is Azure's globally distributed document database, and almost everything confusing about it dissolves once you hold two facts at once: every operation is priced in request units, and every item lives in a logical partition you chose when you created the container. RUs are the currency, the partition key is the constitution, and the five consistency levels are the feature nobody else ships. This page covers all three properly, plus indexing, the change feed, TTL, backups, and a CLI lab that prints the RU cost of every query you run.


What Cosmos DB actually is

Underneath the marketing, Cosmos DB is one thing: a horizontally partitioned, multi-region replicated document store. Items are JSON, grouped into containers, and every container is sharded by a partition key you pick. The storage engine keeps data in an internal format Microsoft calls atom-record-sequence, which is general enough to project several data models on top of the same engine. That projection is the famous multi-model promise, and it deserves an honest reading.

The API you should think of as native is the API for NoSQL (long called the SQL or Core API). It speaks a SQL-flavoured query language over JSON, gets new features first, has the best SDKs, and is what the rest of this page assumes. Around it sit the compatibility APIs: MongoDB (wire-protocol compatibility with a specific server version, not a fork of Mongo itself), Cassandra (CQL over the same backend), Gremlin (property graphs), and Table (a landing pad for old Azure Table Storage workloads). There is also "Azure Cosmos DB for PostgreSQL", which is a different animal entirely: it is Citus, a distributed Postgres, sharing little more than the brand name.

The compatibility APIs are real and supported, but they are second-class in ways that matter. They trail the native API on features, they trail upstream Mongo and Cassandra on protocol version, and the RU model leaks straight through them: a Mongo aggregation pipeline still bills request units, throttles with the same rate-limit behaviour, and obeys the same logical partition limits, except now the cost of each operation is harder to predict because it goes through a translation layer first. The honest guidance: pick the NoSQL API for new work, and use the compatibility APIs for what they were built for, which is migrating an existing application without rewriting its data layer. Teams that pick the Mongo API for a greenfield project usually end up debugging two systems at once.

If you know AWS, the closest relative is DynamoDB, and the comparison is useful in both directions: both are partitioned key-addressed stores where throughput is a first-class billable resource and the partition key decides your fate. Cosmos gives you more dials (consistency levels, richer queries, multi-region writes as a checkbox); DynamoDB gives you fewer ways to hurt yourself. The DynamoDB internals page covers the other side of that family tree.

Request units: the currency of everything

Every operation in Cosmos DB consumes request units. An RU is an abstract bundle of CPU, IOPS and memory, normalised so that one specific operation costs exactly one: a point read of a 1 KB item by its id and partition key costs 1 RU. Everything else is priced relative to that anchor. Writing a 1 KB item costs roughly 5 RU, because a write has to update every index path and replicate to a quorum. Bigger items cost proportionally more, in both directions. A query costs whatever work it causes: a filter answered by the index on a single partition might be 3 to 10 RU, a query that scans, sorts, or fans out across partitions can run to hundreds. Deletes and replaces are priced like writes. Reads at strong or bounded-staleness consistency cost about double the RUs of the weaker levels, because they need a quorum of replicas to agree before answering.

The part people get wrong is the unit of time. Provisioned throughput is RU per second, enforced per second. If you provision 400 RU/s, you have a budget of 400 every second, refreshed every second, and the meter does not bank what you did not spend. A single 80 RU query eats a fifth of that second's budget. This is why a container can be nearly idle on average and still throttle: the budget is a rate, not a monthly pool.

ONE SECOND OF A 400 RU/s CONTAINER400 RU1 RU   point read, 1 KB item6 RU   insert, 1 KB item, default indexing50 RU  single-partition query with an indexed filter180 RU cross-partition query with ORDER BYnext request needs 100 RU, only 53 left→ HTTP 429, x-ms-retry-after-ms: 740budget resets next seconda rate, not a pool — unspent RUs do not carry over
The RU budget is enforced per second. The dashed block is a request that would exceed the remaining budget; Cosmos rejects it with 429 and tells the client when to retry.

When you exceed the budget, Cosmos does not queue your request. It rejects it with HTTP 429 ("request rate too large") and an x-ms-retry-after-ms header saying when to try again. The official SDKs retry these automatically, nine times by default, so a lightly throttled workload mostly shows up as latency rather than errors. That default is a trap in both directions: occasional 429s are normal and fine (Microsoft's own guidance is that a small percentage is healthy utilisation), but a sustained stream of them hiding behind silent retries means your p99 is quietly built out of backoff sleeps. Watch the 429 rate as a metric, not just the error rate your application sees.

You buy RUs in one of three modes, and the choice is mostly a statement about your traffic shape:

ModeHow it worksWhen it fits
Provisioned (manual)A fixed RU/s figure, minimum 400, billed every hour whether used or not. Set on a container or shared across a database.Steady, predictable traffic. Cheapest per RU.
AutoscaleYou set a maximum; Cosmos scales instantly between 10% of that and the max. About 1.5x the unit price of manual.Spiky but sustained workloads. Note the floor: you always pay at least 10% of max.
ServerlessNo provisioning at all; you pay per million RUs consumed. Per-container throughput and storage caps apply.Dev, test, demos, and genuinely intermittent traffic. The lab below uses it.

Two more billing realities round out the model. First, on provisioned containers the RUs you buy are split evenly across the container's physical partitions, which is the subject of the next section and the cause of most "we have spare throughput but still throttle" incidents. Second, every region you replicate to bills its own copy of the throughput, so adding a second region roughly doubles the RU line on the invoice. Neither fact is hidden, but both routinely surprise teams at the first month's bill.

Partitioning: the one decision that matters

Every item in a container carries a partition key, a property path you choose at container creation and can never change without rewriting the data. All items sharing one value of that key form a logical partition, and a logical partition is the unit of three things: locality (single-partition queries are cheap), transactionality (stored procedures and transactional batches work within one logical partition only), and limits. The limit that bites is size: a logical partition can hold at most 20 GB. That is a hard ceiling. If you partition a multi-tenant app by /tenantId and one tenant grows past 20 GB, writes for that tenant start failing and there is no setting to raise; the only fix is a new container with a different key and a data migration, under pressure, in production.

Logical partitions are hashed onto physical partitions, the actual replica sets that store data and serve requests. Each physical partition holds up to about 50 GB and serves up to 10,000 RU/s. Cosmos splits physical partitions automatically as data or provisioned throughput grows, and here is the mechanism behind most throttling mysteries: your provisioned RU/s is divided equally among physical partitions. A container with 30,000 RU/s across six physical partitions gives each one a 5,000 RU/s slice. If one logical partition gets a disproportionate share of traffic, the physical partition it lives on throttles at its slice while its five siblings sit idle, and the container-level metric cheerfully reports 20% utilisation while your users see 429s.

LOGICAL PARTITIONS (one per key value, ≤ 20 GB each)user-adauser-bobtenant-megacorpuser-chouser-deeuser-elihash(partition key) → physical partitionphysical 1≤ 50 GB · 3,333 RU/s slice12% usedphysical 2≤ 50 GB · 3,333 RU/s slice100% — throttling (429)physical 3≤ 50 GB · 3,333 RU/s slice9% usedhot partition: one key value draws the traffic; its physical partition throttles at its slicecontainer shows 40% utilisation overall — and still returns 429s
Provisioned 10,000 RU/s divided across three physical partitions gives each a 3,333 RU/s slice. A skewed key throttles one slice while the others idle.

Choosing the key is therefore an exercise in three properties at once. You want high cardinality (many distinct values), an even spread of both storage and request volume across those values, and a key that appears in the filter of your hottest queries so reads stay single-partition. /userId in a consumer app usually satisfies all three. /tenantId in a B2B app satisfies the third and flunks the first two as soon as one tenant outgrows the rest. /date is the classic disaster: all of today's writes land on a single logical partition, which is exactly one physical partition's slice of your throughput.

When no natural property works, build a synthetic key. Concatenate two properties (tenantId-month) to bound partition size by time, or append a random or computed suffix (device42-s7, with suffixes 0 through 15) to spread one hot entity across sixteen logical partitions. The price of suffixing is that reads for that entity now fan out across all sixteen, so you trade write headroom for read cost. That trade is fine when writes are the bottleneck, which is usually why you are suffixing in the first place. Cross-partition queries in general are not forbidden, just billed honestly: the query fans out to every physical partition, and the RU charge grows with the partition count, which means a query that cost 12 RU in dev on one partition can cost ten times that in prod on twenty.

The two numbers to memorise. 20 GB per logical partition, hard limit, no appeal. About 10,000 RU/s per physical partition, with provisioned throughput split evenly across physical partitions. Most Cosmos incidents trace to one of these two.

The five consistency levels

Most distributed databases give you two settings: strongly consistent reads at a price, or eventual consistency by default. Cosmos DB's flagship feature is a spectrum of five well-defined levels, set as a default on the account and optionally weakened per request. They are not marketing names; each maps onto a guarantee from the distributed-systems literature, and Microsoft published TLA+ specifications of all five, reviewed by Leslie Lamport's group. If you have read the consistency models page, this section will feel like meeting old friends in a product UI.

stronglinearizablebounded stalenesslag ≤ K versions / T secssession (default)read-your-writes per tokenconsistent prefixin order, maybe staleeventualno ordering promiselower read latency, lower RU cost, higher availability →← stronger guarantees, quorum reads (~2x RU), region-outage sensitivityFIVE LEVELS, ONE SLIDER
The spectrum. Strong and bounded staleness read from a quorum and cost about twice the RUs; session is the default and the sweet spot for most applications.

Strong is linearizability: a read anywhere returns the most recent committed write, full stop. Reads wait on a quorum, latency tracks the distance between replicas, and you give up multi-region writes entirely, because a write that must be globally acknowledged before it is readable cannot also be accepted independently in five regions. Use it when the cost of a stale read is genuinely unacceptable, and notice how rarely that is true once you ask the question precisely.

Bounded staleness relaxes strong by a measured amount: reads may lag writes by at most K versions or T seconds, whichever trips first, and the lag you accept never reorders anything. It is the honest answer to "we want strong-ish reads in other regions": you get a quantified, monitorable staleness bound instead of a vague hope. Like strong, its reads are quorum reads at roughly double RU cost.

Session is the default, and the default is clever. Within one session, identified by a token the SDK threads through requests automatically, you get read-your-writes, monotonic reads, monotonic writes, and write-follows-reads. These are exactly the session guarantees from Terry et al.'s Bayou work in the nineties, and they match what almost every application actually needs: the user who just updated their profile sees the update; whether a stranger in another region sees it 80 milliseconds earlier or later is irrelevant. The cleverness is in where the cost lands. Strong consistency makes the server coordinate globally on every operation; session consistency makes each client carry a small token, and the server only has to honour that token's ordering. Per-client guarantees at near-eventual prices. The trap: the guarantee follows the token, so two of your app servers handling the same human in different requests are different sessions unless you propagate the token between them yourself.

Consistent prefix promises you never observe writes out of order: if the truth went A then B then C, you may read A, or A-B, but never A-C without B. You can be arbitrarily stale but never incoherent. Eventual drops even that: replicas converge eventually, reads may briefly go backwards in time, and in exchange you get the lowest latency, the highest availability, and the cheapest reads. It is the right answer for like counts, view counters, and anything else where a momentary misordering is invisible.

Two practical notes. The account default is the strongest you will ever get; per-request overrides can only weaken, never strengthen, so set the account to the strongest level any workload needs and relax per request. And the difference between the levels is invisible in a single-region dev environment under light load, which is why teams discover their actual consistency requirements in production, after adding the second region. Decide on paper first.

Global distribution and multi-region writes

Replicating a Cosmos account to another region is genuinely a one-line operation: add the region, wait for the hydration, done. The default topology is one write region with any number of read regions, and service-managed failover can promote a read region if the write region goes down. SDKs route reads to the nearest region automatically, and session tokens keep the session guarantees intact across that routing.

Turn on multi-region writes and every region accepts writes locally with local latency, which is the headline feature and also the moment you inherit a classic distributed-systems problem: two regions can update the same item concurrently, and someone has to decide who wins. Cosmos gives you three options. The default is last-writer-wins on the item's _ts timestamp (or any numeric path you nominate), which is simple and silently discards the losing write. The second is a custom merge procedure, a stored procedure that receives both versions and decides. The third is the conflict feed: conflicts are parked in a readable feed and your application resolves them offline. Last-writer-wins is fine for data where the newest version is the truth (presence, telemetry, profile fields edited by one human). It is quietly destructive for anything additive, like counters or carts, where "discard one of the two updates" means losing data. If you cannot articulate your merge function, you are not ready for multi-region writes on that container.

And the billing note repeated once because it hurts: each region carries the full provisioned throughput, so a 3-region account pays for its RUs three times, and multi-region write accounts pay a further premium on writes. Global distribution is the easiest expensive checkbox in Azure.

Indexing: everything, by default

Cosmos indexes every path of every item by default. There is no CREATE INDEX, no un-indexed column to forget about; a freshly inserted document is immediately queryable on any property. That default is wonderful for getting started and quietly expensive at scale, because every indexed path is extra work on every write, and write RU charges grow with the number of paths maintained. A wide document with a 40-field nested payload pays to index all 40 fields, whether or not any query ever filters on them.

The fix is the container's indexing policy, a JSON document of included and excluded paths. The pattern that serves most workloads: include the root wildcard, then exclude the heavy subtrees you never filter on, like /payload/* or /rawEvent/*. Trimming a chunky document's policy routinely cuts write costs by a third or more. Two related tools live in the same policy: composite indexes, which you must declare explicitly for queries that ORDER BY two properties or combine multiple range filters efficiently, and spatial indexes for geo queries. Changing the policy triggers an online reindex that competes with your workload for RUs, so do it during quiet hours on big containers.

The feedback loop is the RU charge itself. A query that returns three items for 4 RU used the index; the same query at 600 RU is scanning. The query metrics in the SDK and portal break down where the time went, and the habit of glancing at the charge after writing any new query is the cheapest performance review you will ever do.

The change feed

Every container keeps an ordered log of its changes, readable as the change feed: inserts and updates appear in order within each logical partition, and you can start reading from now, from the beginning, or from a checkpoint. The change feed processor library (and the Azure Functions Cosmos trigger built on it) handles the tedious parts, distributing partitions across consumer instances and checkpointing progress in a lease container, so a consumer that crashes resumes where it left off.

This single feature carries a remarkable share of real Cosmos architectures: materialised views maintained by a consumer that re-shapes documents into a second container with a different partition key (the standard answer to "we need to query by two different keys"), cache invalidation, search-index population, and event-driven pipelines into Service Bus or Event Hubs. The classic gotcha is deletes: the default change feed mode shows the latest version of each item and does not show deletions at all, so the long-standing pattern is a soft-delete flag plus TTL, letting consumers observe the flag before the item expires. A newer all-versions-and-deletes mode lifts this restriction with a retention window, but the soft-delete pattern is still what you will meet in most existing systems.

TTL and backups

Time-to-live is a container setting with per-item overrides. Set defaultTtl to a number of seconds and items expire that long after their last write; set it to -1 and TTL is enabled but nothing expires unless an item carries its own ttl property. Expiry costs you nothing in provisioned terms, because deletion runs opportunistically on leftover RUs, with one consequence worth knowing: on a saturated container, expired items can linger physically. They stop appearing in queries the moment they expire either way, so correctness holds even when the cleanup lags. TTL plus the change feed is the idiomatic Cosmos answer to session stores, device-state caches, and anything else with a natural shelf life.

Backups come in two modes. Periodic, the default, snapshots the account on an interval (4 hours unless you change it) and keeps a short window of copies, and the part that surprises people: restoring means filing a support request and waiting, restored into a new account. Continuous backup gives self-service point-in-time restore to any moment in the last 7 or 30 days depending on tier, for an extra storage-based fee. If the data matters, the decision is not close: continuous mode turns "we deleted the container" from a support-ticket incident into a ten-minute restore you can run yourself at 3 a.m. Check the current feature matrix before committing, since a few account configurations still constrain which mode you can use.

Cosmos vs Azure SQL vs Postgres

Cosmos is the right tool when your access patterns are known and key-shaped, your traffic needs elastic throughput, or your users are global and want local read latency: user profiles, session and device state, catalogs, IoT telemetry with a bounded key space, personalisation data. It is the wrong default for a CRUD application whose data is relational, whose queries are still being discovered, and whose team thinks in joins. Forcing an invoicing schema into documents produces denormalised copies to maintain, cross-partition queries where foreign keys used to be, and an RU bill that tracks your modelling mistakes with great precision.

Within Azure the practical split: Azure SQL when you want the SQL Server ecosystem, T-SQL, and entity-spanning transactions; PostgreSQL Flexible Server as the general-purpose default when nothing forces a managed NoSQL store, with the richest open ecosystem and the easiest exit; Cosmos when the workload genuinely has the shape described above, because when it does, nothing relational will scale as smoothly or as predictably. The framework for making this call honestly, including the questions to ask before the schema exists, lives in choosing a database. And if the account hierarchy in the lab below (resource groups, locations, the az defaults) is unfamiliar, the Azure foundations page covers that plumbing.

Lab: a serverless account, real RU numbers, then teardown

Theory above, meter below. This lab creates a serverless account (no provisioned throughput to pay for, you are billed only for the RUs the lab consumes, which is fractions of a cent), a container partitioned by /userId, then inserts and queries items while printing the RU charge of every operation. You need the az CLI, a logged-in subscription, and Python 3 for the data-plane steps, because az covers the control plane only; there is no az command that inserts an item into a NoSQL-API container, which is itself worth knowing.

  1. Create the resource group and the serverless account. Account creation is the slow step; five to ten minutes is normal.
    az group create -n cosmos-lab -l westeurope az cosmosdb create -n cosmoslab$RANDOM -g cosmos-lab \ --capabilities EnableServerless \ --default-consistency-level Session # capture the generated name for the rest of the lab ACCT=$(az cosmosdb list -g cosmos-lab --query '[0].name' -o tsv)
  2. Create a database and a container with a partition key. Note that the key path is locked in here, forever.
    az cosmosdb sql database create -a $ACCT -g cosmos-lab -n labdb az cosmosdb sql container create -a $ACCT -g cosmos-lab \ -d labdb -n orders \ --partition-key-path "/userId"
  3. Grab the endpoint and key for the data plane.
    export COSMOS_ENDPOINT=$(az cosmosdb show -n $ACCT -g cosmos-lab \ --query documentEndpoint -o tsv) export COSMOS_KEY=$(az cosmosdb keys list -n $ACCT -g cosmos-lab \ --query primaryMasterKey -o tsv) pip install azure-cosmos
  4. Insert items and print the RU charge of each write. Save as lab.py; the request charge rides back on every response as the x-ms-request-charge header.
    import os from azure.cosmos import CosmosClient client = CosmosClient(os.environ['COSMOS_ENDPOINT'], os.environ['COSMOS_KEY']) c = client.get_database_client('labdb').get_container_client('orders') def charge(): return c.client_connection.last_response_headers['x-ms-request-charge'] for i in range(1, 21): item = dict(id='order-' + str(i), userId='user-' + str(i % 4), total=i * 7, status='placed') c.create_item(item) print('write ' + item['id'] + ':', charge(), 'RU')
    Run it with python3 lab.py. Each small write lands around 5 to 7 RU, matching the rule of thumb from earlier.
  5. Compare a point read, a single-partition query, and a fan-out. Append this to lab.py and run again (re-running the writes will error on duplicate ids; comment the loop out or just read the errors past). The three numbers it prints are the whole RU lesson in miniature.
    c.read_item(item='order-3', partition_key='user-3') print('point read:', charge(), 'RU') list(c.query_items( "SELECT * FROM c WHERE c.userId = 'user-1'", partition_key='user-1')) print('single-partition query:', charge(), 'RU') list(c.query_items( 'SELECT * FROM c WHERE c.total > 50', enable_cross_partition_query=True)) print('cross-partition query:', charge(), 'RU')
    Expect roughly 1 RU for the point read, a few RU for the partition-scoped query, and a noticeably larger charge for the fan-out, and remember this container has a handful of physical partitions at most; the fan-out's cost grows with partition count while the point read's never does.
  6. Tear it all down. One command removes the account and everything in it.
    az group delete -n cosmos-lab --yes --no-wait

Total cost of the lab: well under a cent of consumed RUs. Total value: you have now watched the meter that every Cosmos design decision is ultimately negotiating with.

Further reading

Found this useful?