Intermediate

The First Hurdle: Snapshotting

Initial loads are where most CDC rollouts stumble. Follow this guided playbook to capture a consistent snapshot, hold the boundary, and merge the stream without gaps.

Start with a clean handoff

80% of CDC escalations originate during the initial load. Teams rush the boundary, purge logs, and spend weeks replaying tables.

Use these three moves to align engineering, operations, and analytics before the stream goes live.

Prove it in 15 minutes β†’

High-watermark boundary handoff

  1. Mark high-watermark Record LSN/SCN boundary
  2. Consistent read Snapshot without gaps
  3. Stream after boundary Apply idempotently

Snapshot β†’ high-watermark β†’ stream. Preserve per-key order and use idempotent upserts/deletes at the sink.

  • Boundary & retention.

    Set the log high-watermark, confirm retention covers the run, and name an owner for recovery points.

  • Execution made safe.

    Pick the right snapshot mode, chunk large tables, and cap parallelism to respect OLTP workloads.

  • Proof & recovery.

    Run the quick lab, monitor offsets and lag, and document restart checkpoints.

Pre-flight checklist

Confirm these guardrails before the snapshot starts rolling.

View checklist
  • Retention window covers full snapshot duration (WAL/binlog/T-log/UNDO).
  • Privileges for log reading/replication are granted.
  • Consistent-read isolation is enabled where applicable (REPEATABLE READ/SNAPSHOT).
  • Sink MERGE/UPSERT implemented; delete/tombstone semantics defined.
  • Offsets are durable and backed up/versioned.
  • DDL freeze or schema-compatibility policy in place.

Why the boundary wins

Run style What happens mid-snapshot Recovery story
Naive initial load Changes land twice or not at all; log position is unknown. Full re-run of large tables; risky ad-hoc dedupe scripts.
Snapshot with boundary Log handoff preserves ordering; replays apply idempotently. Resume from last chunk + saved offset; no target surgery.
Incremental snapshot Primary-key ranges throttle load while respecting OLTP. Retries pick up the next range; log anchor stays valid.

A CDC pipeline is designed to stream ongoing changes, but it must first be initialized with the data that already exists in the source tables. This process of performing an initial, full copy of the data is known as snapshotting or the initial load.

Why snapshots break

The central problem is consistency. Without a boundary, writes that land mid-run show up twice (snapshot + stream) or disappear entirely. Retention clocks keep ticking and, if the log purges, the only fix is to start over.

Picture a payments table: you start scanning, finance posts refunds, and analytics reconciles against a different truth hours later. The fallout is expensive clean-up, tense incident calls, and a loss of trust in the pipeline.

A staged handoff

Modern log-based CDC tools coordinate a point-in-time snapshot with the transaction log so the handoff to streaming has no gaps or double-counts. Treat the run in three phases:

Snapshot flow diagram showing three phases: marking high watermark, performing consistent read snapshot, and streaming changes from the boundary
The three-phase handoff: mark boundary β†’ snapshot data β†’ stream from boundary

Before snapshot: mark the high-watermark

Read and persist the current log position (LSN/SCN). Share it in the runbook and verify retention outlives the planned duration.

  • Capture boundary metadata alongside connector configuration.
  • Alert on log growth so retention doesn’t expire mid-run.

During snapshot: hold a consistent read

Use snapshot/consistent-read isolation and scan tables in deterministic chunks. The goal is predictable load that won’t pin OLTP locks or exhaust undo segments.

  • Chunk big tables in primary-key order and cap parallelism.
  • Read from a replica when possible; monitor lag to stay honest.

After snapshot: stream from the boundary

Start consuming the log from the recorded position. Any rows that change post-snapshot must apply idempotently so retries are safe.

  • Implement UPSERT/MERGE semantics and explicit delete handling.
  • Version and store offsets so restart is a click, not a rebuild.

Snapshot Lifecycle

Run it on your database

Grab the snippet for your engine to capture the boundary and hold a consistent read while Debezium (or another connector) walks the tables.

PostgreSQL
-- Requirements (once/admin):
ALTER SYSTEM SET wal_level = logical;
ALTER SYSTEM SET max_replication_slots = 10;  -- adjust
SELECT pg_reload_conf();

-- Boundary + consistent read (within one tx while snapshotting):
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SELECT pg_current_wal_lsn();  -- record this
-- Keep transaction open during table scans; Debezium coordinates via logical slot.
MySQL (InnoDB)
-- Requirements:
-- binlog_format=ROW; binlog_row_image=FULL; GTID (preferred)

-- Boundary + consistent snapshot:
SHOW MASTER STATUS;                  -- or SELECT @@global.gtid_executed;
START TRANSACTION WITH CONSISTENT SNAPSHOT;
-- Read tables in PK order; Debezium avoids FTWRL in OLTP paths.
SQL Server
-- Requirements (once/admin):
ALTER DATABASE YourDb SET ALLOW_SNAPSHOT_ISOLATION ON;
ALTER DATABASE YourDb SET READ_COMMITTED_SNAPSHOT ON;
EXEC sys.sp_cdc_enable_db;  -- if using native CDC

-- Boundary + snapshot:
SELECT sys.fn_cdc_get_max_lsn();     -- record
SET TRANSACTION ISOLATION LEVEL SNAPSHOT;
BEGIN TRAN;
-- Read tables; keep tx open during snapshot.
Oracle
-- Requirements (once/admin):
ALTER DATABASE ADD SUPPLEMENTAL LOG DATA;  -- minimal or ALL columns for PK-less

-- Boundary:
SELECT CURRENT_SCN FROM V$DATABASE;        -- record
-- Consistent reads use UNDO/flashback; ensure UNDO retention covers snapshot.
How consistent reads behave
  • PostgreSQL: REPEATABLE READ + pg_export_snapshot() keeps scans stable; long scans can delay VACUUM, so watch table bloat alerts.
  • MySQL/InnoDB: START TRANSACTION WITH CONSISTENT SNAPSHOT anchors to a binlog positionβ€”no FLUSH TABLES WITH READ LOCK required for OLTP safety.
  • SQL Server: CDC tables + max LSN establish the boundary; snapshot isolation or read committed snapshot keeps readers from blocking writers.
  • Oracle: A saved SCN plus UNDO retention gives you flashback reads; make sure UNDO window covers the run.

Snapshot modes (choose intentionally)

  • initial: Do a snapshot then continue streaming (default for first run).
  • initial_only: One-time bulk load, no streaming (bootstraps batch targets).
  • schema_only / never: Skip data snapshot (for pre-seeded targets or special cases).
  • incremental snapshot: Chunk tables online (no global pause) using low/high watermarks and primary-key ranges; supports resuming mid-table.

Performance: chunking & throughput

  • Chunking: Read large tables in PK-ordered chunks (5k–200k rows) to bound memory and reduce long transactions. Prefer monotonic PKs; otherwise use synthetic chunking keys.
  • Parallelism: Limit concurrent table snapshots to protect OLTP (1–3 in parallel).
  • Throttling: Use fetch size / max rows per poll to keep steady pressure, not bursts.
  • Replica reads: For MySQL/PG, consider reading from a replica to offload the primary (ensure replication lag is acceptable and log positions still align).

Operational guardrails (day-one checks)

  • Log retention: Ensure WAL/binlog/T-log won’t be purged before snapshot finishes. Alert on backlog age and size.
  • Privileges: Replication/log-read permissions; publication/slot creation where applicable.
  • DDL during snapshot: Prefer compatibility rules (schema registry) and deploy windows; avoid table rebuilds that invalidate consistent reads.
  • Sink idempotency: Upsert/delete semantics in the warehouse/search index must be idempotent so replays (post-snapshot) don’t duplicate data.
  • Resume safely: Snapshots should be restartable: resume from last completed chunk and the recorded high-watermark.

Idempotent sink patterns

Use UPSERT/MERGE semantics and explicit delete handling.

Warehouse MERGE example
MERGE INTO dw.customers AS t
USING staging.customers AS s
ON t.customer_id = s.customer_id
WHEN MATCHED THEN UPDATE SET
  name = s.name,
  email = s.email,
  updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (customer_id, name, email, updated_at)
VALUES (s.customer_id, s.name, s.email, s.updated_at);

Observability: what to track

  • Source log lag (bytes/time behind) and slot/binlog growth.
  • Snapshot throughput (rows/s), chunk duration, ETA by table.
  • Sink merge latency and deadlocks.
  • Error budget: % chunks retried; time to resume after failure.

Failure modes β†’ actions

  • Long snapshots block VACUUM/UNDO: shrink chunk size; run on a replica; pause between chunks.
  • Log purged mid-snapshot: restart from last completed chunk; increase retention; consider replica.
  • No primary key: define synthetic key or dedupe in sink (hash + latest timestamp).
  • DDL during snapshot: retry chunk with stable schema; versioned views; schema-compat rules.
  • Replica lag: validate boundary alignment; if laggy, snapshot from primary or temporarily gate OLTP.

Hands-on: prove no gaps/dupes (β‰ˆ15 min)

  1. Pick your database and run the β€œboundary + consistent read” snippet.
  2. While snapshot runs, perform writes in a second session: insert (id=3), update (id=2).
  3. After handoff, insert (id=4). Verify sink via MERGE and run a dup check.
Dup check query
SELECT id, COUNT(*) AS c
FROM dw.orders
GROUP BY id
HAVING COUNT(*) > 1;

Common gotchas

Snapshotting Knowledge Check

Test your understanding of CDC snapshotting concepts and best practices.

Q1

What is the primary challenge when performing an initial snapshot in CDC?

Q2

What is a 'high-watermark' in the context of CDC snapshotting?

Q3

Why is it important to set log retention before starting a snapshot?

Q4

What is the recommended approach for snapshotting very large tables?

Q5

What happens after the snapshot completes in a well-designed CDC pipeline?

0/5 correct
Progress 0% No progress yet
Progress is stored locally in this browser.