Reference

CDC Glossary

The vocabulary you need to talk about Change Data Capture without hand-waving. Definitions stay vendor-agnostic where they can and call out the specific name each database/broker uses where they can't.

About this glossary

Each entry has a stable anchor. Deep-link from a module page or a Slack thread with /letstalkcdc/glossary/#tombstone and the reader lands on the definition directly. Aliases list other names you'll see in the wild; "see also" cross-links the related entries.

Definitions

# Checkpoint

The persisted record of "I have successfully processed everything up to LSN/SCN X" stored by a CDC consumer. Restarts replay from the checkpoint, so a missed flush means duplicates on resume; idempotent sinks handle this. Most connectors checkpoint after a confirmed sink write, not after a read.

# Compaction

Kafka's "keep the latest value per key" retention mode. Combined with tombstones it gives you a materialized view of state on the topic that a late-joining consumer can rebuild without replaying every historical write — useful for downstream state stores. The gotcha: until compaction actually runs (it's a background process, not instant), readers still see every intermediate value; until delete.retention.ms elapses, tombstones linger.

# Dead-letter queue (also: DLQ)

A side topic / table where the connector parks events it can't process — typically schema mismatches, deserialization failures, or sink errors. Without a DLQ, the connector either drops the event (data loss) or stalls the whole partition (head-of-line blocking). With one, the bad records are visible and triageable.

# Effectively-once

The pragmatic alternative to exactly-once: the source is at-least-once, but the sink's idempotent writes plus a durable event_id ledger collapse duplicates so the observable end state matches an exactly-once delivery. This is what most production CDC pipelines actually ship.

# Exactly-once

The (mostly aspirational) guarantee that every source event lands in the sink exactly once, with no duplicates and no drops. End-to-end exactly-once across a CDC pipeline requires coordinated transactions in the source, the broker, and the sink — most stacks settle for "effectively-once" (at-least-once delivery + idempotent sinks + a deduplication ledger). The errata page covers the specific traps.

# Idempotent write

An operation that produces the same end state regardless of how many times it's replayed. Keyed MERGE / UPSERT on a stable primary key is the most common pattern. Without idempotency, at-least-once delivery from the source amplifies into duplicate rows in the sink on every connector restart.

# LSN / SCN (also: LSN, SCN, log sequence number, system change number)

A monotonically increasing identifier the database assigns to each log position. Postgres calls it the LSN (Log Sequence Number); Oracle calls it the SCN (System Change Number); MySQL's GTID + binlog coordinates and SQL Server's __$start_lsn serve the same purpose. CDC consumers checkpoint progress as the last applied LSN/SCN so a restart resumes without gaps or duplicates.

# Lag

The time between a source commit and the corresponding sink apply. Track p50/p95/p99, not just the mean — CDC lag is bursty (DDL, large transactions, consumer restarts), and the average hides the tail you actually have to capacity-plan against.

# Log retention

How long the source database keeps WAL/binlog/redo segments before recycling them. If a CDC consumer falls behind and its checkpoint LSN ages out, the connector can't resume — it has to bootstrap with a full snapshot. Tune retention to cover your worst-case consumer outage plus a margin; on Postgres this means wal_keep_size (or replication slots, which are stricter) and max_slot_wal_keep_size.

# Partition key

The field whose hash decides which Kafka partition (or equivalent) a message lands on. CDC ordering is guaranteed per partition key, not cross-key — if you partition by user_id, every event for a given user is ordered, but events across users are not. Choose the key to match the unit of ordering your downstream actually needs.

# Schema evolution

The discipline of versioning the event payload's shape so producers can add fields without breaking consumers. Additive changes (new optional fields) are safe; renames and removals require a migration window with both the old and new field present. Schema Registry (Confluent, Karapace, AWS Glue) enforces compatibility rules at produce time.

# Snapshot (also: initial snapshot, incremental snapshot)

The bootstrap process: read the current state of every row, emit it as change events, then switch to streaming the log. Initial snapshots can interleave with live changes — design consumers to reconcile by version column or op_ts. Incremental snapshots (signal- based, popularized by Debezium) let you re-snapshot a subset without taking down the whole connector, at the cost of potential duplicates the sink has to dedupe.

# Tombstone

A Kafka message with a null value and a populated key, used by log compaction as the "delete this key" marker. In Debezium and similar CDC producers, a source-row delete is carried by the normal change event (the envelope's op field is "d" and the before block holds the pre-delete row); the tombstone is an optional follow-up message that lets a compacted topic eventually drop the key. Sinks read the delete from the change event, not from the tombstone. Tombstones need a non-zero delete.retention.ms long enough for every consumer to see them before compaction reclaims the slot.

# WAL / Redo log (also: WAL, binlog, redo log, T-log)

The database's append-only record of committed writes. Log-based CDC tools tail this log instead of polling the source tables — Postgres calls it the WAL (Write-Ahead Log), MySQL the binlog, Oracle the redo log, SQL Server the transaction log (T-log). All four serve the same role: a durable, ordered stream of every committed change.

Missing a term?

The glossary is data-driven — add entries to src/_data/glossary.mjs and the page picks them up at build time. Anchor IDs are stable: don't rename a slug after it's been linked to externally.