09 / 09

Azure / 09

Service Bus & Event Hubs

Azure ships three messaging services and the names do not help you pick. The split is simpler than it looks. Service Bus is an enterprise message broker for commands and work that must not be lost. Event Hubs is a partitioned event log, Kafka-shaped, for telemetry and streams measured in events per second. Event Grid is reactive routing for notifications: something happened, push it to whoever cares. This page covers how each one actually works, when to use which, and how they combine — then a CLI lab where you dead-letter a real message and read a stream from a consumer group.

Three services, one sentence each

Almost every messaging decision on Azure comes down to the shape of what you are sending. A command is an instruction to do work: charge this card, ship this order, resize this image. It must be processed exactly once-ish, it usually has a single owner, and losing one is a bug with a ticket attached. An event stream is a firehose of facts: clicks, sensor readings, log lines. Individual events matter less than the aggregate, consumers want to replay history, and throughput is the headline number. A notification is a poke: a blob was created, a VM was deallocated, a subscription is about to expire. It carries little data, fans out to many interested parties, and the sender does not care what they do with it.

Service Bus is built for commands, Event Hubs for streams, Event Grid for notifications. The services share a portal blade and a vaguely similar vocabulary, but underneath they are different machines. Service Bus is a broker: it owns each message, tracks who has it, and destroys it once it is settled. Event Hubs is a log: it appends events to partitions and lets readers keep their own place, the same model Kafka uses. Event Grid is a router: it holds nothing for long, it just pushes copies of events at subscribers until they answer 200 or run out of retries.

Question	Service Bus	Event Hubs	Event Grid
What is it	Message broker	Partitioned event log	Event router
Payload shape	Commands, work items	Telemetry, streams	Notifications
Delivery	Pull, with settlement	Pull, offset-based	Push, HTTP with retry
Replay	No — messages are consumed	Yes — within retention	No
Ordering	FIFO with sessions	Per partition	None guaranteed
Typical scale	Thousands/sec	Millions/sec	Bursty fan-out
AWS cousin	SQS + SNS	Kinesis / MSK	EventBridge

If you have read the AWS messaging page, the mapping is close but not exact, and the differences are worth a section of their own near the end. First, the decision in picture form, then each service in depth.

The decision in one pass. If the payload is work, broker it. If it is a stream, log it. If it is a poke, route it.

Service Bus: queues and topics

Service Bus is the descendant of every message queue you have met: a namespace holds queues and topics, producers send messages into them, and consumers pull messages out. A queue is point-to-point. Many producers can send, many consumers can receive, but each message is delivered to exactly one consumer and removed once that consumer settles it. This is the work-distribution shape: ten workers pulling from one orders queue naturally share the load, and adding an eleventh worker needs no configuration at all.

A topic adds publish/subscribe on top of the same machinery. Producers send to the topic; the topic copies each message into one or more subscriptions, and each subscription behaves like its own queue with its own consumers, its own settlement, and its own dead-letter sub-queue. The copy step is where the routing logic lives. Every subscription carries a filter, written in a small SQL-like grammar, evaluated against the message's application properties. A subscription with the rule region = 'eu' AND priority > 2 receives only the messages that match; a TRUE filter receives everything. Filters can also have actions that rewrite properties as the message is copied, which is handy for stamping routing metadata without touching the producer.

This is a real architectural lever. One billing topic with subscriptions for invoicing, fraud checks, and analytics replaces three queues and a fan-out service you would otherwise have to write. Each subscriber team owns their subscription, their filter, and their backlog, and a slow analytics consumer cannot delay the fraud check, because each subscription is an independent copy of the stream.

Peek-lock: how settlement actually works

The heart of Service Bus, and the thing that separates a broker from a log, is the settlement model. There are two receive modes. Receive-and-delete hands you the message and deletes it immediately: fast, and fine for telemetry you can afford to drop, but if your process dies between receiving and acting, the message is gone. Peek-lock, the default and the mode you should assume, is a two-step protocol. The broker hands you the message and simultaneously locks it: it stays in the queue, invisible to other consumers, with a lock that expires after a configurable duration (up to five minutes, renewable). You do your work, then you settle.

Settlement has three honest outcomes. Complete tells the broker the work is done; the message is deleted forever. Abandon says you could not do it; the lock is released, the message becomes visible again, and its delivery count goes up by one. Dead-letter says this message is bad and retrying will not help; the broker moves it into the queue's dead-letter sub-queue, a real queue at queue/$DeadLetterQueue that you can receive from like any other. There is a fourth, implicit outcome: do nothing. If your process crashes mid-work, the lock simply expires and the message reappears, which is the crash-safety story in one sentence.

The settlement lifecycle. Every path out of "locked" is explicit except crash, which is just lock expiry plus a retry.

The delivery count is the poison-message circuit breaker. Every queue has a MaxDeliveryCount (default ten). When a message has been delivered and abandoned, or its lock has expired, more times than the limit allows, the broker dead-letters it automatically rather than handing it out again. A message that crashes its consumer on every attempt gets ten tries and then lands in the DLQ with a reason attached, instead of grinding your workers in an infinite loop. Monitoring the DLQ depth is the single most useful Service Bus alert you can set.

At-least-once, not exactly-once. Peek-lock guarantees a message is never lost while unsettled, but a crash after you finish the work and before you call complete means the message comes back. Your handler will see the occasional duplicate, so make it idempotent: key side effects on the message id, or use the duplicate detection described below for the send side.

Sessions, duplicates, schedules, and transactions

A plain queue with competing consumers cannot promise ordering: two workers pull two messages and there is no telling which finishes first. Sessions fix this the same way Kafka's partition keys do, but at the broker. Producers stamp each message with a SessionId (an order id, a device id, a user id) and a session-enabled queue guarantees that all messages with the same session id are delivered, in order, to whichever consumer currently holds the lock on that session. One consumer owns one session at a time, so messages for order 4711 are processed strictly in sequence while thousands of other sessions proceed in parallel on other workers. Sessions also carry a small piece of broker-side state per session, which teams use to checkpoint multi-step workflows without an extra database.

Duplicate detection handles the send side of the exactly-once problem. Enable it on a queue and the broker remembers every MessageId it has seen within a configurable window (up to seven days, default ten minutes) and silently drops repeats. A producer that times out and retries the same send, with the same message id, cannot enqueue the work twice. Combined with an idempotent consumer this gets you close enough to exactly-once for almost all business workloads.

Two smaller tools round out the broker. Scheduled messages are sent now but become visible at a time you choose, which gives you delayed retries and "remind the customer in 24 hours" jobs without a cron service; cancelling the schedule before it fires is also supported. And transactions let you group operations against one namespace, such as completing the incoming message and sending two outgoing ones, so they commit or roll back together. The classic use is a processing step that consumes from one queue and produces to another: with a transaction, a crash cannot leave you having consumed without producing. The send-via pattern routes the outgoing send through the transfer queue so even cross-entity flows stay atomic. None of this spans to your database, though, so the outbox pattern still earns its keep where SQL is involved.

One pricing note worth knowing before an interview or an invoice: the Standard tier is pay-per-operation on shared infrastructure, while Premium buys dedicated messaging units with predictable latency, larger message sizes (up to 100 MB), and availability-zone replication. Production systems with latency promises generally live on Premium.

Event Hubs: the partitioned log

Event Hubs answers a different question: not "how do I hand out work safely" but "how do I ingest a million events a second and let several systems read them independently". The design is the log, the same idea Kafka made famous. An event hub is split into partitions, fixed at creation (1 to 32 on Standard, more on Premium and Dedicated). Each partition is an append-only sequence of events, ordered within itself, retained for a configurable period (one to seven days on Standard, up to 90 with Premium) and then aged out regardless of who has read what. Producers either let the service spread events round-robin across partitions or set a partition key, which hashes to a specific partition and guarantees every event with that key lands in the same ordered sequence — the per-device, per-user ordering trick again.

Consumers do not receive messages; they read partitions. Each reader attaches to a partition at some offset and walks forward at its own pace. The broker deletes nothing on read and tracks nothing about readers. That inversion is the whole point: because the broker does per-event bookkeeping for nobody, it can ingest at rates a settlement-based broker never could, and because events stay put, a new consumer can start from the beginning of retention and replay history that existing consumers already processed.

Two consumer groups over the same four partitions. The broker stores events once; every group brings its own bookmarks.

A consumer group is a named view over the hub: one set of offsets per group, per partition. Your alerting pipeline and your analytics pipeline read the same events through different groups, at different speeds, without seeing each other. Within a group, the convention enforced by the SDKs is at most one active reader per partition, and the EventProcessor client handles the tedious part: it spreads partitions across however many worker instances you run, rebalances when instances come and go, and persists checkpoints (the last processed offset per partition) to a blob storage container you provide. Note where the responsibility sits: the service does not store your position, your checkpoint store does. Checkpoint after a batch, not after every event, because each checkpoint is a blob write; and accept that a worker crash replays everything since its last checkpoint, which is the standard at-least-once deal.

Two features matter at the architecture level. Capture writes every event to Blob Storage or Data Lake automatically, in Avro, on a time or size cadence you set, so the cold path (batch analytics over months of data) falls out of configuration instead of a consumer you maintain. And the Kafka-compatible endpoint means any Kafka client at protocol 1.0 or later can produce to and consume from an event hub by changing the bootstrap server and turning on SASL — no code changes. Hub maps to topic, consumer group to consumer group. The compatibility has edges (no compacted topics, no Kafka Streams against it, no transactions), but for the common produce/consume case it turns "migrate off self-managed Kafka" from a rewrite into a config change. If the trade is on your whiteboard, the honest comparison is operational: Event Hubs gives you the Kafka model without brokers to patch, ZooKeeper or KRaft to babysit, or rebalances to tune, in exchange for caps on retention, partition counts, and protocol features that real Kafka does not have.

Capacity is bought in throughput units on the Standard tier: each TU is roughly 1 MB/s or 1,000 events/s in, and 2 MB/s out, with auto-inflate available to grow TUs under load (it does not shrink them back). Premium sells processing units with isolation and bigger limits; Dedicated sells whole clusters. The thing to remember is that partition count is your parallelism ceiling and is fixed at creation on most tiers, so pick it for the throughput you will need in two years, not the one you have today.

Event Grid: reactive routing

Event Grid is the odd one out because it pushes. A topic receives events; an event subscription attaches a destination to a topic, with optional filters on subject and event type; and Event Grid delivers each matching event to each subscription by calling it: a webhook, an Azure Function, a Service Bus queue or topic, an event hub, or a storage queue. There is no polling loop to run and nothing to read. Your handler is invoked when something happens, which is why Event Grid is the connective tissue of serverless designs on Azure.

The delivery contract is at-least-once with serious retry. If the endpoint does not answer with a success code, Event Grid retries on an exponential backoff schedule for up to 24 hours. If delivery never succeeds, the event can be dead-lettered to a blob container you nominate, so even a push system gives you a place to find what was dropped. Ordering is explicitly not promised, and events are small (think kilobytes, with a 1 MB ceiling), which reinforces what the service is for: pokes, not payloads. The standard pattern is to put a pointer in the event (the blob URL, the resource id) and let the handler fetch the data.

The killer feature is system topics: most Azure services publish their lifecycle events to Event Grid natively. Blob created or deleted, resource group changed, container image pushed, key vault secret near expiry, subscription cost threshold hit — all of these are subscribable events with no agent or polling on your side. "Run a function whenever a blob lands in this container" is one event subscription, and it is the canonical Event Grid example for a reason. Custom topics let your own applications publish the same way, and Event Grid namespaces add an MQTT broker for IoT-style fan-in if you need it.

Choosing, and combining

The single-sentence test from the top settles most cases: work goes to Service Bus, streams go to Event Hubs, pokes go to Event Grid. The classic mistakes are using Event Hubs as a work queue (there is no settlement, no per-message retry, no DLQ, so one bad event stalls a partition or gets skipped by hand-rolled logic) and using Service Bus as a telemetry pipe (you will pay per message for bookkeeping you do not need, and hit throughput walls a log would shrug at). Event Grid gets misused as a data channel; it is a doorbell, not a delivery truck.

Real systems combine them, and one pipeline shows the seams clearly. Devices fire telemetry into Event Hubs, which absorbs the volume. A Functions app reads the stream with the Event Hubs trigger, aggregates and filters, and for the small fraction of readings that demand action, sends a command to a Service Bus queue — because from this point each item is work that must survive crashes, retries, and a poison-message path. Workers consume with peek-lock and the DLQ catches the strays. Meanwhile Event Grid watches the storage account where Event Hubs Capture lands its Avro files, and kicks off a batch job whenever a new file appears. Each service does the one thing it is shaped for, and the joints between them are exactly where the delivery semantics change.

The Rosetta Stone: AWS and GCP equivalents

If you carry mental models over from another cloud, the mapping is useful but lossy, so it pays to know where the seams are.

Azure	AWS	GCP	Where the analogy leaks
Service Bus queue	SQS	Pub/Sub (pull)	SQS FIFO caps throughput per group; Service Bus sessions scale by session count and add session state
Service Bus topic	SNS → SQS fan-out	Pub/Sub topic	Service Bus does broker-side SQL filters and actions; SNS filtering is JSON policies, less expressive
Event Hubs	Kinesis Data Streams / MSK	Pub/Sub (closest)	Pub/Sub hides partitions entirely; Kinesis shards resemble partitions but bill differently
Event Grid	EventBridge	Eventarc	EventBridge adds a schema registry and content-based rules on the body; Event Grid filters on envelope fields

The deeper difference is philosophical. AWS composes small primitives: you wire SNS to SQS yourself and bring your own dedup keys. Azure builds the features into the broker: duplicate detection, sessions, transactions, and scheduled delivery are checkboxes on the entity rather than patterns you assemble. GCP's Pub/Sub collapses queue and stream into one service and hides partitioning, which is simpler until you need strict per-key ordering or replay semantics, where Event Hubs' explicit partitions map more directly to how you reason about Kafka-style systems. The full AWS-side story is on the AWS messaging page if you want the mirror image of this one.

CLI lab: settle a message, read a stream

Two exercises, one resource group, about fifteen minutes including the namespace provisioning waits. The first creates a Service Bus queue with a deliberately tiny delivery limit, then walks a message through peek-lock, abandon, and into the dead-letter queue using nothing but the REST API, so you see the settlement verbs with your own eyes. The second creates an event hub with four partitions, sends a batch, and reads it back through a fresh consumer group. You need the az CLI logged in, plus python3 and curl.

Setup. One resource group and a Service Bus namespace. Namespace names are global, hence the random suffix.

RG=msg-lab
LOC=westeurope
NS=sblab$RANDOM

az group create -n $RG -l $LOC
az servicebus namespace create -g $RG -n $NS --sku Standard

1. A queue with a hair-trigger DLQ. Max delivery count of one means a single failed delivery dead-letters the message — never do this in production, perfect for a lab.

az servicebus queue create -g $RG --namespace-name $NS \
  -n orders --max-delivery-count 1

KEY=$(az servicebus namespace authorization-rule keys list \
  -g $RG --namespace-name $NS -n RootManageSharedAccessKey \
  --query primaryKey -o tsv)

The data plane wants a SAS token, which is an HMAC over the resource URI and an expiry. One python line builds it from the key you just fetched.

export NS KEY
SAS=$(python3 -c "import urllib.parse,hmac,hashlib,base64,time,os; \
uri=urllib.parse.quote_plus('https://'+os.environ['NS']+'.servicebus.windows.net'); \
exp=str(int(time.time())+3600); \
sig=urllib.parse.quote_plus(base64.b64encode(hmac.new(os.environ['KEY'].encode(), \
(uri+'\n'+exp).encode(),hashlib.sha256).digest())); \
print('SharedAccessSignature sr='+uri+'&sig='+sig+'&se='+exp+'&skn=RootManageSharedAccessKey')")

2. Send, then receive with peek-lock. The POST to messages/head is the peek-lock receive; note the Location header in the response — that URL is your lock, and the HTTP verb you use on it is the settlement.

# send
curl -s -X POST "https://$NS.servicebus.windows.net/orders/messages" \
  -H "Authorization: $SAS" -H "Content-Type: text/plain" \
  -d "ship order 4711"

# peek-lock receive: 201, body returned, Location header = the lock
curl -si -X POST "https://$NS.servicebus.windows.net/orders/messages/head?timeout=20" \
  -H "Authorization: $SAS" | tee /tmp/receipt.txt

LOCK=$(grep -i '^location:' /tmp/receipt.txt | tr -d '\r' | cut -d' ' -f2)

3. Settle it the failing way. A DELETE on the lock URL would be complete. A PUT is abandon: the message goes back on the queue and its delivery count ticks up to one, which already equals the max. The next receive attempt does not deliver it; the broker dead-letters it instead, and you can pull it from $DeadLetterQueue with the dead-letter reason in the headers.

# abandon — unlock, delivery count is now 1 of max 1
curl -s -X PUT "$LOCK" -H "Authorization: $SAS"

# try to receive again: 204 No Content — it has moved to the DLQ
curl -si -X POST "https://$NS.servicebus.windows.net/orders/messages/head?timeout=20" \
  -H "Authorization: $SAS" | head -1

# read the dead-letter queue (receive-and-delete this time)
curl -si -X DELETE \
  "https://$NS.servicebus.windows.net/orders/\$DeadLetterQueue/messages/head?timeout=20" \
  -H "Authorization: $SAS"

Look at the BrokerProperties header on that last response: it carries DeadLetterReason: MaxDeliveryCountExceeded, which is exactly what your production alert should be reading. That is the whole settlement model in four curl commands: POST to lock, DELETE to complete, PUT to abandon, and the broker handling poison for you.

4. An event hub with four partitions. Plus a consumer group of our own, because sharing $Default across pipelines is how offsets get trampled.

EHNS=ehlab$RANDOM
az eventhubs namespace create -g $RG -n $EHNS --sku Standard
az eventhubs eventhub create -g $RG --namespace-name $EHNS \
  -n telemetry --partition-count 4
az eventhubs eventhub consumer-group create -g $RG \
  --namespace-name $EHNS --eventhub-name telemetry -n analytics

CONN=$(az eventhubs namespace authorization-rule keys list \
  -g $RG --namespace-name $EHNS -n RootManageSharedAccessKey \
  --query primaryConnectionString -o tsv)
export CONN

5. Produce a batch, keyed so each device sticks to one partition. The data plane is AMQP, so the SDK is the honest tool here (pip install azure-eventhub). Save and run:

# send.py
import os
from azure.eventhub import EventHubProducerClient, EventData

producer = EventHubProducerClient.from_connection_string(
    os.environ["CONN"], eventhub_name="telemetry")
for device in ("sensor-a", "sensor-b", "sensor-c"):
    batch = producer.create_batch(partition_key=device)
    for i in range(5):
        batch.add(EventData(device + " reading " + str(i)))
    producer.send_batch(batch)
producer.close()
print("sent 15 events across 3 partition keys")

6. Read it back through the analytics group. Starting position -1 means the beginning of retention — this is the replay everyone talks about. Watch the partition ids: each sensor's readings arrive in order, all from the same partition. Ctrl-C when the events stop.

# read.py
import os
from azure.eventhub import EventHubConsumerClient

consumer = EventHubConsumerClient.from_connection_string(
    os.environ["CONN"], consumer_group="analytics", eventhub_name="telemetry")

def on_event(ctx, event):
    print("partition", ctx.partition_id, "offset", event.offset,
          "→", event.body_as_str())

with consumer:
    consumer.receive(on_event=on_event, starting_position="-1")

Run read.py a second time and the same fifteen events print again: nothing was consumed, because a log does not consume. If you wired a checkpoint store (a blob container passed to the client), the second run would start where the first left off instead — that difference is the entire Service Bus versus Event Hubs argument in two runs of one script.

7. Teardown. Everything lives in the one group.

az group delete -n $RG --yes --no-wait

Service Bus & Event Hubs

Three services, one sentence each

Service Bus: queues and topics

Peek-lock: how settlement actually works

Sessions, duplicates, schedules, and transactions

Event Hubs: the partitioned log

Event Grid: reactive routing

Choosing, and combining

The Rosetta Stone: AWS and GCP equivalents

CLI lab: settle a message, read a stream

Further reading

Back to the Azure codex