Advanced

CDC Nuances & Errata

Bookmark the gotchas: delivery guarantees, snapshots, and backfills all hide sharp edges. Use this checklist before declaring “exactly once.”

A living list of precise clarifications, corrections, and “gotchas”. Treat this as pre‑flight checks before promising exactly‑once or “zero data loss.”

End‑to‑end exactly‑once vs. effectively‑once

  • Source connectors are at‑least‑once in practice. You can achieve effectively‑once by making sinks idempotent (UPSERT on stable keys) and/or deduplicating using a durable event_id ledger.
  • Transaction boundaries: Database transactions map to change streams, not to sink transactions unless your pipeline coordinates them. Don’t assume cross‑topic atomicity.
  • Per‑key ordering only: Ordering is guaranteed per partition key (primary/business key). Cross‑entity ordering is not guaranteed and shouldn’t be relied upon.

Snapshots, backfills, and replays

  • Initial snapshots can interleave with live changes; design consumers to reconcile using version columns or op_ts.
  • Incremental snapshots (signal‑based) are powerful but can produce duplicates; always merge on keys + highest version and keep idempotent sinks.
  • Backfills are just another snapshot; isolate them (topics/headers) to avoid double‑counting.

Tombstones, compaction, and deletions

  • Tombstones are required for compaction to fully remove keys; keep a non‑zero delete.retention.ms long enough for consumers.
  • Soft‑delete models still need physical deletes or a retention strategy in the sink to avoid data resurrection.

Schema evolution

  • Always version payload schemas and keep consumers tolerant to additive changes. No producer should introduce breaking field removals/renames without a migration window.
  • Use optional + default values judiciously to avoid “poison” events.

Operational guardrails

  • Watch lag/throughput and max.queue.size, max.batch.size, poll.interval.ms. Oversizing queue without heap tuning yields GC stalls.
  • Alert on connector task failures, restart loops, and earliest‑offset resets.
  • Prefer per‑table connectors for isolation when SLAs differ; otherwise group by SLA and update cadence.

Recommended dedup ledger

-- event_id is a stable UUID in headers or payload
create table if not exists dw.processed_event(
  event_id text primary key,
  processed_at timestamptz not null default now()
);

On ingest:

insert into dw.processed_event(event_id)
values (:event_id)
on conflict do nothing;
Progress 0% No progress yet
Progress is stored locally in this browser.