Exactly-Once Semantics | CDC: The Missing Manual

End-to-end "exactly-once delivery" across heterogeneous systems (DB → Kafka → warehouse/search) is not achievable in general. What we actually ship is Exactly-Once Processing (EOP): combine At-Least-Once (ALO) transport with idempotent sinks so replays don’t create duplicates.

Scope matters:

Source connectors (e.g., Debezium): ALO into Kafka. They can’t make the external database “transactional” with Kafka.
Within Kafka→Kafka topologies: EOS is possible using Kafka transactions (idempotent producers + transactional offsets), but the scope is limited to Kafka topics.
Kafka → external sinks: Effective EOS comes from idempotent upserts (or staging+MERGE) at the destination, not from transport guarantees alone.

How a CDC pipeline behaves during failures is defined by its data delivery guarantees:

At-Most-Once: The weakest guarantee. A message is sent once, but if a failure occurs before it's processed, the message is lost forever. Unsuitable for critical data.
At-Least-Once: The most common default. The system ensures every message is delivered, but a message might be delivered more than once during a retry. This prevents data loss but requires consumers to handle duplicates.
Exactly-Once Processing: The effective goal. The system delivers every message at least once, and the consumer is designed to process each unique message precisely one time, even if it receives duplicates.

The Problem: Duplicates in At-Least-Once Systems

Imagine a consumer reads an event from a Kafka topic, processes it (writes to a database), and then crashes before it can commit the topic offset. When the consumer restarts, it will re-read the same event, leading to duplicate processing. The interactive demo below walks through this failure scenario.

The Solution: Idempotent Consumers

Idempotency is the property of an operation that allows it to be applied multiple times without changing the result beyond the initial application. An idempotent consumer can safely process the same message multiple times with no side effects. This requires two things:

A Unique, Deterministic Event ID: Every change event must have a unique identifier. This can be a UUID from an outbox table or a composite key of the source table's primary key and the transaction's log position.
Deduplication Logic at the Sink: The consumer uses this ID to check if the event has already been processed before applying the change. Common patterns include:
- Using `MERGE` or `UPSERT`: The sink database handles the logic of creating a new record or updating an existing one atomically. This is the preferred method.
- Using a Deduplication Table: The consumer first tries to `INSERT` the event ID into a `processed_events` table. If it succeeds, it proceeds; if it fails (due to a primary key violation), it knows the event is a duplicate and can be safely ignored.
- Staging + MERGE: Land events into a staging table keyed by event_id, then MERGE into the target to make replays safe.

Commit order: write to sink → then commit offsets. If offsets commit first, a crash can drop data; idempotent sinks tolerate replays after restart.

Naive Consumer

// Pseudocode: may create duplicates
for (event of stream) {
  // Simple insert will create a new
  // record for every retry.
  insert_into_sink(event.payload);
  commit_offset();
}

Idempotent Consumer

// Pseudocode: safe from duplicates
for (event of stream) {
  // MERGE/UPSERT handles existing records
  // based on a primary key.
  upsert_into_sink(
    event.key,
    event.payload
  );
  commit_offset();
}

Warehouse MERGE (Snowflake-style)

MERGE INTO dw.orders AS t
USING (SELECT :event_id AS event_id, :order_id AS order_id, :payload::variant AS p) s
ON t.event_id = s.event_id
WHEN NOT MATCHED THEN
  INSERT (event_id, order_id, payload) VALUES (s.event_id, s.order_id, s.p)
WHEN MATCHED THEN
  UPDATE SET payload = s.p;  -- idempotent overwrite

BigQuery MERGE (idempotent)

MERGE `dw.orders` t
USING (SELECT @event_id AS event_id, @order_id AS order_id, @payload AS payload) s
ON t.event_id = s.event_id
WHEN NOT MATCHED THEN
  INSERT (event_id, order_id, payload) VALUES (s.event_id, s.order_id, s.payload)
WHEN MATCHED THEN
  UPDATE SET payload = s.payload;

The Transactional Outbox Pattern (Solving the Dual-Write Problem)

A common anti-pattern is the "dual-write," where an application first writes to its database and *then*, in a separate network call, tries to publish a message. If the application crashes between these two steps, the systems become inconsistent. The Transactional Outbox Pattern solves this.

Transactional outbox pattern sequence diagram showing atomic database transaction writing both business data and event record, followed by CDC relay publishing to event bus — Outbox pattern: Write business data and event atomically, then publish via CDC

An application writes both its business data (an `orders` record) and an event record to an "outbox" table within the same single, atomic database transaction.
A log-based CDC process monitors the outbox table. When it detects the committed transaction in the database log, it reliably reads the event from the log and publishes it to a message bus like Kafka.
This guarantees that an event is published if, and only if, the corresponding business transaction was successfully committed to the database. Data consistency is preserved.

Important scope note: Outbox gives you EOS between the application DB and Kafka (no lost/phantom messages). It does not make your downstream warehouses exactly-once by itself— you still need idempotent consumers at the sinks.

Exactly-Once Processing, For Real

The Problem: Duplicates in At-Least-Once Systems

1. Consume Message

2. Process & Write to Sink

3. CRASH! 💥

4. Restart & Re-process

The Solution: Idempotent Consumers

Naive Consumer

Idempotent Consumer

Warehouse MERGE (Snowflake-style)

BigQuery MERGE (idempotent)

The Transactional Outbox Pattern (Solving the Dual-Write Problem)

Exactly-Once Semantics Knowledge Check

What does 'exactly-once semantics' (EOS) truly guarantee in distributed systems?

What is the primary purpose of the Transactional Outbox pattern?

Why is achieving true end-to-end exactly-once semantics challenging in CDC pipelines?

How does Kafka Streams achieve exactly-once semantics within Kafka?

What is an alternative to true EOS when it's not achievable?

Let’s Talk CDC Interactive Dashboard