Snapshotting (Initial Load) | CDC: The Missing Manual

Run style	What happens mid-snapshot	Recovery story
Naive initial load	Changes land twice or not at all; log position is unknown.	Full re-run of large tables; risky ad-hoc dedupe scripts.
Snapshot with boundary	Log handoff preserves ordering; replays apply idempotently.	Resume from last chunk + saved offset; no target surgery.
Incremental snapshot	Primary-key ranges throttle load while respecting OLTP.	Retries pick up the next range; log anchor stays valid.

A CDC pipeline is designed to stream ongoing changes, but it must first be initialized with the data that already exists in the source tables. This process of performing an initial, full copy of the data is known as snapshotting or the initial load.

Why snapshots break

The central problem is consistency. Without a boundary, writes that land mid-run show up twice (snapshot + stream) or disappear entirely. Retention clocks keep ticking and, if the log purges, the only fix is to start over.

Picture a payments table: you start scanning, finance posts refunds, and analytics reconciles against a different truth hours later. The fallout is expensive clean-up, tense incident calls, and a loss of trust in the pipeline.

A staged handoff

Modern log-based CDC tools coordinate a point-in-time snapshot with the transaction log so the handoff to streaming has no gaps or double-counts. Treat the run in three phases:

Snapshot flow diagram showing three phases: marking high watermark, performing consistent read snapshot, and streaming changes from the boundary — The three-phase handoff: mark boundary → snapshot data → stream from boundary

Before snapshot: mark the high-watermark

Read and persist the current log position (LSN/SCN). Share it in the runbook and verify retention outlives the planned duration.

Capture boundary metadata alongside connector configuration.
Alert on log growth so retention doesn’t expire mid-run.

During snapshot: hold a consistent read

Use snapshot/consistent-read isolation and scan tables in deterministic chunks. The goal is predictable load that won’t pin OLTP locks or exhaust undo segments.

Chunk big tables in primary-key order and cap parallelism.
Read from a replica when possible; monitor lag to stay honest.

After snapshot: stream from the boundary

Start consuming the log from the recorded position. Any rows that change post-snapshot must apply idempotently so retries are safe.

Implement UPSERT/MERGE semantics and explicit delete handling.
Version and store offsets so restart is a click, not a rebuild.

Snapshot Lifecycle

Plan & Pre-flight

Prepare

Verify retention, privileges, and isolation settings before starting.

Mark High-Watermark

Boundary

Capture the current log position (LSN/SCN) as the snapshot boundary.

Consistent Read

Snapshot

Scan tables in deterministic chunks under snapshot isolation.

Stream from Boundary

Handoff

Begin consuming the transaction log from the recorded high-watermark.

Validate & Monitor

Verify

Compare row counts, check lag metrics, and set up ongoing monitoring.

Run it on your database

Grab the snippet for your engine to capture the boundary and hold a consistent read while Debezium (or another connector) walks the tables.

PostgreSQL

-- Requirements (once/admin):
ALTER SYSTEM SET wal_level = logical;
ALTER SYSTEM SET max_replication_slots = 10;  -- adjust
SELECT pg_reload_conf();

-- Boundary + consistent read (within one tx while snapshotting):
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SELECT pg_current_wal_lsn();  -- record this
-- Keep transaction open during table scans; Debezium coordinates via logical slot.

MySQL (InnoDB)

-- Requirements:
-- binlog_format=ROW; binlog_row_image=FULL; GTID (preferred)

-- Boundary + consistent snapshot:
SHOW MASTER STATUS;                  -- or SELECT @@global.gtid_executed;
START TRANSACTION WITH CONSISTENT SNAPSHOT;
-- Read tables in PK order; Debezium avoids FTWRL in OLTP paths.

SQL Server

-- Requirements (once/admin):
ALTER DATABASE YourDb SET ALLOW_SNAPSHOT_ISOLATION ON;
ALTER DATABASE YourDb SET READ_COMMITTED_SNAPSHOT ON;
EXEC sys.sp_cdc_enable_db;  -- if using native CDC

-- Boundary + snapshot:
SELECT sys.fn_cdc_get_max_lsn();     -- record
SET TRANSACTION ISOLATION LEVEL SNAPSHOT;
BEGIN TRAN;
-- Read tables; keep tx open during snapshot.

Oracle

-- Requirements (once/admin):
ALTER DATABASE ADD SUPPLEMENTAL LOG DATA;  -- minimal or ALL columns for PK-less

-- Boundary:
SELECT CURRENT_SCN FROM V$DATABASE;        -- record
-- Consistent reads use UNDO/flashback; ensure UNDO retention covers snapshot.

How consistent reads behave

PostgreSQL: REPEATABLE READ + pg_export_snapshot() keeps scans stable; long scans can delay VACUUM, so watch table bloat alerts.
MySQL/InnoDB: START TRANSACTION WITH CONSISTENT SNAPSHOT anchors to a binlog position—no FLUSH TABLES WITH READ LOCK required for OLTP safety.
SQL Server: CDC tables + max LSN establish the boundary; snapshot isolation or read committed snapshot keeps readers from blocking writers.
Oracle: A saved SCN plus UNDO retention gives you flashback reads; make sure UNDO window covers the run.

Snapshot modes (choose intentionally)

initial: Do a snapshot then continue streaming (default for first run).
initial_only: One-time bulk load, no streaming (bootstraps batch targets).
schema_only / never: Skip data snapshot (for pre-seeded targets or special cases).
incremental snapshot: Chunk tables online (no global pause) using low/high watermarks and primary-key ranges; supports resuming mid-table.

Performance: chunking & throughput

Chunking: Read large tables in PK-ordered chunks (5k–200k rows) to bound memory and reduce long transactions. Prefer monotonic PKs; otherwise use synthetic chunking keys.
Parallelism: Limit concurrent table snapshots to protect OLTP (1–3 in parallel).
Throttling: Use fetch size / max rows per poll to keep steady pressure, not bursts.
Replica reads: For MySQL/PG, consider reading from a replica to offload the primary (ensure replication lag is acceptable and log positions still align).

Operational guardrails (day-one checks)

Log retention: Ensure WAL/binlog/T-log won’t be purged before snapshot finishes. Alert on backlog age and size.
Privileges: Replication/log-read permissions; publication/slot creation where applicable.
DDL during snapshot: Prefer compatibility rules (schema registry) and deploy windows; avoid table rebuilds that invalidate consistent reads.
Sink idempotency: Upsert/delete semantics in the warehouse/search index must be idempotent so replays (post-snapshot) don’t duplicate data.
Resume safely: Snapshots should be restartable: resume from last completed chunk and the recorded high-watermark.

Idempotent sink patterns

Use UPSERT/MERGE semantics and explicit delete handling.

Warehouse MERGE example

MERGE INTO dw.customers AS t
USING staging.customers AS s
ON t.customer_id = s.customer_id
WHEN MATCHED THEN UPDATE SET
  name = s.name,
  email = s.email,
  updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (customer_id, name, email, updated_at)
VALUES (s.customer_id, s.name, s.email, s.updated_at);

Gotcha: Handle deletes explicitly (tombstones or a __op field). Otherwise replays produce resurrected rows.

Observability: what to track

Source log lag (bytes/time behind) and slot/binlog growth.
Snapshot throughput (rows/s), chunk duration, ETA by table.
Sink merge latency and deadlocks.
Error budget: % chunks retried; time to resume after failure.

Failure modes → actions

Long snapshots block VACUUM/UNDO: shrink chunk size; run on a replica; pause between chunks.
Log purged mid-snapshot: restart from last completed chunk; increase retention; consider replica.
No primary key: define synthetic key or dedupe in sink (hash + latest timestamp).
DDL during snapshot: retry chunk with stable schema; versioned views; schema-compat rules.
Replica lag: validate boundary alignment; if laggy, snapshot from primary or temporarily gate OLTP.

Hands-on: prove no gaps/dupes (≈15 min)

Pick your database and run the “boundary + consistent read” snippet.
While snapshot runs, perform writes in a second session: insert (id=3), update (id=2).
After handoff, insert (id=4). Verify sink via MERGE and run a dup check.

Dup check query

SELECT id, COUNT(*) AS c
FROM dw.orders
GROUP BY id
HAVING COUNT(*) > 1;

The First Hurdle: Snapshotting

Start with a clean handoff

Pre-flight checklist

Why the boundary wins

Why snapshots break

A staged handoff

Before snapshot: mark the high-watermark

During snapshot: hold a consistent read

After snapshot: stream from the boundary

Snapshot Lifecycle

Plan & Pre-flight

Mark High-Watermark

Consistent Read

Stream from Boundary

Validate & Monitor

Run it on your database

Snapshot modes (choose intentionally)

Performance: chunking & throughput

Operational guardrails (day-one checks)

Idempotent sink patterns

Observability: what to track

Failure modes → actions

Hands-on: prove no gaps/dupes (≈15 min)

Common gotchas

Snapshotting Knowledge Check

What is the primary challenge when performing an initial snapshot in CDC?

What is a 'high-watermark' in the context of CDC snapshotting?

Why is it important to set log retention before starting a snapshot?

What is the recommended approach for snapshotting very large tables?

What happens after the snapshot completes in a well-designed CDC pipeline?

Let’s Talk CDC Interactive Dashboard

The First Hurdle: Snapshotting

Start with a clean handoff

High-watermark boundary handoff

Pre-flight checklist

Why the boundary wins

Why snapshots break

A staged handoff

Before snapshot: mark the high-watermark

During snapshot: hold a consistent read

After snapshot: stream from the boundary

Snapshot Lifecycle

Plan & Pre-flight

Mark High-Watermark

Consistent Read

Stream from Boundary

Validate & Monitor

Run it on your database

Snapshot modes (choose intentionally)

Performance: chunking & throughput

Operational guardrails (day-one checks)

Idempotent sink patterns

Observability: what to track

Failure modes → actions

Hands-on: prove no gaps/dupes (≈15 min)

Common gotchas

What is the primary challenge when performing an initial snapshot in CDC?

What is a 'high-watermark' in the context of CDC snapshotting?

Why is it important to set log retention before starting a snapshot?

What is the recommended approach for snapshotting very large tables?

What happens after the snapshot completes in a well-designed CDC pipeline?