Advanced
CDC Nuances & Errata
Bookmark the gotchas: delivery guarantees, snapshots, and backfills all hide sharp edges. Use this checklist before declaring “exactly once.”
A living list of precise clarifications, corrections, and “gotchas”. Treat this as pre‑flight checks before promising exactly‑once or “zero data loss.”
End‑to‑end exactly‑once vs. effectively‑once
-
Source connectors are at‑least‑once in practice.
You can achieve effectively‑once by making sinks idempotent
(UPSERT on stable keys) and/or deduplicating using a
durable
event_idledger. - Transaction boundaries: Database transactions map to change streams, not to sink transactions unless your pipeline coordinates them. Don’t assume cross‑topic atomicity.
- Per‑key ordering only: Ordering is guaranteed per partition key (primary/business key). Cross‑entity ordering is not guaranteed and shouldn’t be relied upon.
Snapshots, backfills, and replays
-
Initial snapshots can interleave with live changes;
design consumers to reconcile using version columns or
op_ts. - Incremental snapshots (signal‑based) are powerful but can produce duplicates; always merge on keys + highest version and keep idempotent sinks.
- Backfills are just another snapshot; isolate them (topics/headers) to avoid double‑counting.
Tombstones, compaction, and deletions
- Tombstones are required for compaction to fully remove keys; keep a non‑zero delete.retention.ms long enough for consumers.
- Soft‑delete models still need physical deletes or a retention strategy in the sink to avoid data resurrection.
Schema evolution
- Always version payload schemas and keep consumers tolerant to additive changes. No producer should introduce breaking field removals/renames without a migration window.
-
Use
optional+ default values judiciously to avoid “poison” events.
Operational guardrails
-
Watch lag/throughput and
max.queue.size,max.batch.size,poll.interval.ms. Oversizing queue without heap tuning yields GC stalls. - Alert on connector task failures, restart loops, and earliest‑offset resets.
- Prefer per‑table connectors for isolation when SLAs differ; otherwise group by SLA and update cadence.
Recommended dedup ledger
-- event_id is a stable UUID in headers or payload
create table if not exists dw.processed_event(
event_id text primary key,
processed_at timestamptz not null default now()
);
On ingest:
insert into dw.processed_event(event_id)
values (:event_id)
on conflict do nothing;