A CDC pipeline is designed to stream ongoing changes, but it must first be initialized with the data that already exists in the source tables. This process of performing an initial, full copy of the data is known as snapshotting or the initial load.
Why snapshots break
The central problem is consistency. Without a boundary, writes that land mid-run show up twice (snapshot + stream) or disappear entirely. Retention clocks keep ticking and, if the log purges, the only fix is to start over.
Picture a payments table: you start scanning, finance posts refunds, and analytics reconciles against a different truth hours later. The fallout is expensive clean-up, tense incident calls, and a loss of trust in the pipeline.
A staged handoff
Modern log-based CDC tools coordinate a point-in-time snapshot with the transaction log so the handoff to streaming has no gaps or double-counts. Treat the run in three phases:
Before snapshot: mark the high-watermark
Read and persist the current log position (LSN/SCN). Share it in the runbook and verify retention outlives the planned duration.
- Capture boundary metadata alongside connector configuration.
- Alert on log growth so retention doesnβt expire mid-run.
During snapshot: hold a consistent read
Use snapshot/consistent-read isolation and scan tables in deterministic chunks. The goal is predictable load that wonβt pin OLTP locks or exhaust undo segments.
- Chunk big tables in primary-key order and cap parallelism.
- Read from a replica when possible; monitor lag to stay honest.
After snapshot: stream from the boundary
Start consuming the log from the recorded position. Any rows that change post-snapshot must apply idempotently so retries are safe.
- Implement UPSERT/MERGE semantics and explicit delete handling.
- Version and store offsets so restart is a click, not a rebuild.
Snapshot Lifecycle
Plan & Pre-flight
Verify retention, privileges, and isolation settings before starting.
Confirm log retention covers the expected snapshot duration. Grant replication privileges. Enable consistent-read isolation (REPEATABLE READ or SNAPSHOT). Document rollback steps in the runbook.
Mark High-Watermark
Capture the current log position (LSN/SCN) as the snapshot boundary.
Record the exact log sequence number before scanning begins. This position marks where the snapshot ends and streaming will start. Store it alongside connector configuration for recovery.
Consistent Read
Scan tables in deterministic chunks under snapshot isolation.
Read tables using primary-key ordering. Chunk large tables to cap memory and lock duration. Use a replica if possible and monitor replication lag. Each chunk can be retried independently on failure.
Stream from Boundary
Begin consuming the transaction log from the recorded high-watermark.
The connector transitions seamlessly from snapshot to streaming mode. Changes are applied idempotently with UPSERT/MERGE semantics. Per-key ordering ensures consistency even during retries.
Validate & Monitor
Compare row counts, check lag metrics, and set up ongoing monitoring.
Run row-count reconciliation between source and target. Monitor consumer lag and offset progression. Set alerts for lag spikes or stalled offsets. Document the recovery checkpoint for future restarts.
Run it on your database
Grab the snippet for your engine to capture the boundary and hold a consistent read while Debezium (or another connector) walks the tables.
PostgreSQL
-- Requirements (once/admin):
ALTER SYSTEM SET wal_level = logical;
ALTER SYSTEM SET max_replication_slots = 10; -- adjust
SELECT pg_reload_conf();
-- Boundary + consistent read (within one tx while snapshotting):
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SELECT pg_current_wal_lsn(); -- record this
-- Keep transaction open during table scans; Debezium coordinates via logical slot.
MySQL (InnoDB)
-- Requirements:
-- binlog_format=ROW; binlog_row_image=FULL; GTID (preferred)
-- Boundary + consistent snapshot:
SHOW MASTER STATUS; -- or SELECT @@global.gtid_executed;
START TRANSACTION WITH CONSISTENT SNAPSHOT;
-- Read tables in PK order; Debezium avoids FTWRL in OLTP paths.
SQL Server
-- Requirements (once/admin):
ALTER DATABASE YourDb SET ALLOW_SNAPSHOT_ISOLATION ON;
ALTER DATABASE YourDb SET READ_COMMITTED_SNAPSHOT ON;
EXEC sys.sp_cdc_enable_db; -- if using native CDC
-- Boundary + snapshot:
SELECT sys.fn_cdc_get_max_lsn(); -- record
SET TRANSACTION ISOLATION LEVEL SNAPSHOT;
BEGIN TRAN;
-- Read tables; keep tx open during snapshot.
Oracle
-- Requirements (once/admin):
ALTER DATABASE ADD SUPPLEMENTAL LOG DATA; -- minimal or ALL columns for PK-less
-- Boundary:
SELECT CURRENT_SCN FROM V$DATABASE; -- record
-- Consistent reads use UNDO/flashback; ensure UNDO retention covers snapshot.
How consistent reads behave
-
PostgreSQL: REPEATABLE READ +
pg_export_snapshot()keeps scans stable; long scans can delay VACUUM, so watch table bloat alerts. -
MySQL/InnoDB:
START TRANSACTION WITH CONSISTENT SNAPSHOTanchors to a binlog positionβnoFLUSH TABLES WITH READ LOCKrequired for OLTP safety. - SQL Server: CDC tables + max LSN establish the boundary; snapshot isolation or read committed snapshot keeps readers from blocking writers.
- Oracle: A saved SCN plus UNDO retention gives you flashback reads; make sure UNDO window covers the run.
Snapshot modes (choose intentionally)
- initial: Do a snapshot then continue streaming (default for first run).
- initial_only: One-time bulk load, no streaming (bootstraps batch targets).
- schema_only / never: Skip data snapshot (for pre-seeded targets or special cases).
- incremental snapshot: Chunk tables online (no global pause) using low/high watermarks and primary-key ranges; supports resuming mid-table.
Performance: chunking & throughput
- Chunking: Read large tables in PK-ordered chunks (5kβ200k rows) to bound memory and reduce long transactions. Prefer monotonic PKs; otherwise use synthetic chunking keys.
- Parallelism: Limit concurrent table snapshots to protect OLTP (1β3 in parallel).
- Throttling: Use fetch size / max rows per poll to keep steady pressure, not bursts.
- Replica reads: For MySQL/PG, consider reading from a replica to offload the primary (ensure replication lag is acceptable and log positions still align).
Operational guardrails (day-one checks)
- Log retention: Ensure WAL/binlog/T-log wonβt be purged before snapshot finishes. Alert on backlog age and size.
- Privileges: Replication/log-read permissions; publication/slot creation where applicable.
- DDL during snapshot: Prefer compatibility rules (schema registry) and deploy windows; avoid table rebuilds that invalidate consistent reads.
- Sink idempotency: Upsert/delete semantics in the warehouse/search index must be idempotent so replays (post-snapshot) donβt duplicate data.
- Resume safely: Snapshots should be restartable: resume from last completed chunk and the recorded high-watermark.
Idempotent sink patterns
Use UPSERT/MERGE semantics and explicit delete handling.
Warehouse MERGE example
MERGE INTO dw.customers AS t
USING staging.customers AS s
ON t.customer_id = s.customer_id
WHEN MATCHED THEN UPDATE SET
name = s.name,
email = s.email,
updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (customer_id, name, email, updated_at)
VALUES (s.customer_id, s.name, s.email, s.updated_at);
Observability: what to track
- Source log lag (bytes/time behind) and slot/binlog growth.
- Snapshot throughput (rows/s), chunk duration, ETA by table.
- Sink merge latency and deadlocks.
- Error budget: % chunks retried; time to resume after failure.
Failure modes β actions
- Long snapshots block VACUUM/UNDO: shrink chunk size; run on a replica; pause between chunks.
- Log purged mid-snapshot: restart from last completed chunk; increase retention; consider replica.
- No primary key: define synthetic key or dedupe in sink (hash + latest timestamp).
- DDL during snapshot: retry chunk with stable schema; versioned views; schema-compat rules.
- Replica lag: validate boundary alignment; if laggy, snapshot from primary or temporarily gate OLTP.
Hands-on: prove no gaps/dupes (β15 min)
- Pick your database and run the βboundary + consistent readβ snippet.
- While snapshot runs, perform writes in a second session: insert (id=3), update (id=2).
- After handoff, insert (id=4). Verify sink via MERGE and run a dup check.
Dup check query
SELECT id, COUNT(*) AS c
FROM dw.orders
GROUP BY id
HAVING COUNT(*) > 1;
Common gotchas
Snapshotting Knowledge Check
Test your understanding of CDC snapshotting concepts and best practices.
What is the primary challenge when performing an initial snapshot in CDC?
The key challenge is coordinating the snapshot with the transaction log position (LSN/SCN) to ensure a clean handoff. The snapshot must capture the state at a specific point, and streaming must begin from exactly that point to avoid gaps or duplicate records.
Review the correct answer and explanation.
What is a 'high-watermark' in the context of CDC snapshotting?
The high-watermark is a specific log position (LSN for PostgreSQL, SCN for Oracle, etc.) that marks the boundary. All changes up to this point are included in the snapshot, and streaming begins from this position forward. This ensures no data is lost or duplicated during the handoff.
Review the correct answer and explanation.
Why is it important to set log retention before starting a snapshot?
If the snapshot takes hours or days, the database might purge old transaction logs before the snapshot completes. Setting adequate log retention ensures that when streaming starts from the high-watermark, all necessary log entries are still available.
Review the correct answer and explanation.
What is the recommended approach for snapshotting very large tables?
For large tables, chunking (reading in batches based on primary key ranges) reduces memory pressure, allows progress to be saved incrementally, and respects production database workloads. This approach also enables parallelization and retry of failed chunks.
Review the correct answer and explanation.
What happens after the snapshot completes in a well-designed CDC pipeline?
After the snapshot completes, the CDC connector seamlessly transitions to streaming mode, reading changes from the transaction log starting at the high-watermark position. This creates a continuous flow of data without gaps, combining the snapshot with real-time changes.
Review the correct answer and explanation.