CDC Glossary
The vocabulary you need to talk about Change Data Capture without hand-waving. Definitions stay vendor-agnostic where they can and call out the specific name each database/broker uses where they can't.
About this glossary
Each entry has a stable anchor. Deep-link from a module page or a
Slack thread with /letstalkcdc/glossary/#tombstone
and the reader lands on the definition directly. Aliases list
other names you'll see in the wild; "see also" cross-links the
related entries.
Definitions
- # Checkpoint
-
The persisted record of "I have successfully processed everything up to LSN/SCN X" stored by a CDC consumer. Restarts replay from the checkpoint, so a missed flush means duplicates on resume; idempotent sinks handle this. Most connectors checkpoint after a confirmed sink write, not after a read.
- # Compaction
-
Kafka's "keep the latest value per key" retention mode. Combined with tombstones it gives you a materialized view of state on the topic that a late-joining consumer can rebuild without replaying every historical write — useful for downstream state stores. The gotcha: until compaction actually runs (it's a background process, not instant), readers still see every intermediate value; until
delete.retention.mselapses, tombstones linger. -
#
Dead-letter queue
(also:
DLQ) -
A side topic / table where the connector parks events it can't process — typically schema mismatches, deserialization failures, or sink errors. Without a DLQ, the connector either drops the event (data loss) or stalls the whole partition (head-of-line blocking). With one, the bad records are visible and triageable.
- # Effectively-once
-
The pragmatic alternative to exactly-once: the source is at-least-once, but the sink's idempotent writes plus a durable
event_idledger collapse duplicates so the observable end state matches an exactly-once delivery. This is what most production CDC pipelines actually ship. - # Exactly-once
-
The (mostly aspirational) guarantee that every source event lands in the sink exactly once, with no duplicates and no drops. End-to-end exactly-once across a CDC pipeline requires coordinated transactions in the source, the broker, and the sink — most stacks settle for "effectively-once" (at-least-once delivery + idempotent sinks + a deduplication ledger). The errata page covers the specific traps.
- # Idempotent write
-
An operation that produces the same end state regardless of how many times it's replayed. Keyed
MERGE/UPSERTon a stable primary key is the most common pattern. Without idempotency, at-least-once delivery from the source amplifies into duplicate rows in the sink on every connector restart. -
#
LSN / SCN
(also:
LSN,SCN,log sequence number,system change number) -
A monotonically increasing identifier the database assigns to each log position. Postgres calls it the LSN (Log Sequence Number); Oracle calls it the SCN (System Change Number); MySQL's GTID + binlog coordinates and SQL Server's
__$start_lsnserve the same purpose. CDC consumers checkpoint progress as the last applied LSN/SCN so a restart resumes without gaps or duplicates. - # Lag
-
The time between a source commit and the corresponding sink apply. Track p50/p95/p99, not just the mean — CDC lag is bursty (DDL, large transactions, consumer restarts), and the average hides the tail you actually have to capacity-plan against.
- # Log retention
-
How long the source database keeps WAL/binlog/redo segments before recycling them. If a CDC consumer falls behind and its checkpoint LSN ages out, the connector can't resume — it has to bootstrap with a full snapshot. Tune retention to cover your worst-case consumer outage plus a margin; on Postgres this means
wal_keep_size(or replication slots, which are stricter) andmax_slot_wal_keep_size. - # Partition key
-
The field whose hash decides which Kafka partition (or equivalent) a message lands on. CDC ordering is guaranteed per partition key, not cross-key — if you partition by
user_id, every event for a given user is ordered, but events across users are not. Choose the key to match the unit of ordering your downstream actually needs. - # Schema evolution
-
The discipline of versioning the event payload's shape so producers can add fields without breaking consumers. Additive changes (new optional fields) are safe; renames and removals require a migration window with both the old and new field present. Schema Registry (Confluent, Karapace, AWS Glue) enforces compatibility rules at produce time.
-
#
Snapshot
(also:
initial snapshot,incremental snapshot) -
The bootstrap process: read the current state of every row, emit it as change events, then switch to streaming the log. Initial snapshots can interleave with live changes — design consumers to reconcile by version column or
op_ts. Incremental snapshots (signal- based, popularized by Debezium) let you re-snapshot a subset without taking down the whole connector, at the cost of potential duplicates the sink has to dedupe. - # Tombstone
-
A Kafka message with a
nullvalue and a populated key, used by log compaction as the "delete this key" marker. In Debezium and similar CDC producers, a source-row delete is carried by the normal change event (the envelope'sopfield is"d"and thebeforeblock holds the pre-delete row); the tombstone is an optional follow-up message that lets a compacted topic eventually drop the key. Sinks read the delete from the change event, not from the tombstone. Tombstones need a non-zerodelete.retention.mslong enough for every consumer to see them before compaction reclaims the slot. -
#
WAL / Redo log
(also:
WAL,binlog,redo log,T-log) -
The database's append-only record of committed writes. Log-based CDC tools tail this log instead of polling the source tables — Postgres calls it the WAL (Write-Ahead Log), MySQL the binlog, Oracle the redo log, SQL Server the transaction log (T-log). All four serve the same role: a durable, ordered stream of every committed change.
Missing a term?
The glossary is data-driven — add entries to
src/_data/glossary.mjs and the page picks them up at
build time. Anchor IDs are stable: don't rename a slug
after it's been linked to externally.