First 15 Minutes: CDC Troubleshooting
A repeatable triage to stabilize incidents fast. Keep it pragmatic, measurable, and reversible.
Quick Triage (5–8 minutes)
- Freeze changes that make state drift: pause backfills, schema changes, and connector restarts.
- Define the failure mode: stuck (no progress), slow (lag growing), wrong (duplicates, missing rows), or crashing.
- Bound the blast radius: single table/tenant vs global; snapshot vs stream.
- Checkpoint evidence (timestamps, connector name, offsets, group id). Don’t tail logs without writing down the top lines you see.
- Choose a safety: if logs are at risk of rotating out, extend retention (WAL/binlog/redo) before touching the pipeline.
Goal: stop data loss, capture proof, buy time. Detailed checks below.
Artifacts to Collect (Copy/Paste)
- Versions: source db, connector (Debezium), broker (Kafka/Redpanda), sink
- Connector config (sanitized): include snapshot mode, includes/excludes, heartbeat
- Exact error lines: 20–50 lines around the first failure
- Lag/offsets: consumer group lag and last committed LSN/GTID/SCN
- Log retention settings: WAL/binlog/redo retention and current oldest log
Commands
Kafka (Consumer Lag)
kafka-consumer-groups --bootstrap-server <host:port> \
--describe --group <your-sink-group>
Connect (Connector Status)
curl -s http://<connect-host:8083>/connectors | jq
curl -s http://<connect-host:8083>/connectors/<name>/status | jq
Stabilize First
- Ensure source logs retained ≥ time to fix + snapshot duration
- Enable DLQ (or equivalent) to prevent silent drops
- Reduce parallelism if source is choking (snapshot & stream task counts)
- If duplicates are appearing, convert sink writes to idempotent upserts (MERGE/UPSERT) immediately
Postgres Quick Checks
Configuration + Health
-- WAL & slots
SHOW wal_level; -- should be 'logical'
SHOW max_wal_senders; -- >= number of replication clients
SHOW max_replication_slots; -- >= slots in use
SELECT slot_name, active, restart_lsn, confirmed_flush_lsn
FROM pg_replication_slots;
-- retention & pressure
SHOW wal_keep_size; -- size hint for retained WAL
SELECT now() - pg_last_wal_replay_lsn()::text::pg_lsn; -- on standby
-- lag (if using logical slot)
SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_from_slot
FROM pg_replication_slots WHERE slot_name='<slot>';
Common Symptoms → First Aid
- Snapshot stalls: reduce fetch size; whitelist fewer tables; ensure long-running txns aren’t blocking.
- Slot disk pressure: sink caught behind → raise sink throughput or pause snapshot; never drop the slot without a plan.
-
Missing updates: replica identity not full on keyless
tables → set
REPLICA IDENTITY FULLwhere needed.
MySQL Quick Checks
Configuration + Health
-- binlog mode
SHOW VARIABLES LIKE 'binlog_format'; -- should be ROW
SHOW VARIABLES LIKE 'binlog_row_image'; -- FULL or MINIMAL (know your connector's needs)
SHOW VARIABLES LIKE 'gtid_mode'; -- ON preferred
SHOW MASTER STATUS; -- file/pos + executed GTID set
-- retention (8.0+)
SHOW VARIABLES LIKE 'binlog_expire_logs_seconds';
-- heartbeat/throughput
SHOW GLOBAL STATUS LIKE 'Binlog_cache_disk_use';
Common Symptoms → First Aid
- Duplicates after restart: sink not idempotent → change inserts to MERGE/UPSERT with stable pk + version.
- Lost history: binlogs expired during snapshot → extend expire window; restart from fresh snapshot.
-
DDL breakage: add
include.schema.changes(if supported) and verify sink DDL policy.
Oracle Quick Checks
Configuration + Health
-- supplemental logging
SELECT supplemental_log_data_min, supplemental_log_data_all FROM v$database;
-- database log mode
SELECT log_mode FROM v$database; -- ARCHIVELOG recommended for CDC
-- redo switch & archive status
SELECT sequence#, archived, status FROM v$log ORDER BY first_time DESC FETCH FIRST 5 ROWS ONLY;
-- confirm key columns logging for keyless tables (optional)
SELECT * FROM dba_log_groups WHERE LOG_GROUP_TYPE IN ('ALL COLUMN LOGGING','PRIMARY KEY LOGGING');
Common Symptoms → First Aid
- High redo churn: throttle snapshot/table set; ensure filters aren’t too broad.
- Missed updates: missing supplemental logging for key columns → add minimal or table-level logging, resnapshot.
- Catalog mode mismatch after upgrade: re-check connector’s catalog strategy defaults before restart.
Kafka / Connect Quick Checks
Status & Lag
# list & inspect
curl -s http://<connect:8083>/connectors | jq
curl -s http://<connect:8083>/connectors/<name>/status | jq
# consumer lag (sink)
kafka-consumer-groups --bootstrap-server <host:port> \
--describe --group <your-sink-group>
# safe restart of a sick task
curl -s -XPOST http://<connect:8083>/connectors/<name>/tasks/0/restart
Common Symptoms → First Aid
-
Connector crash loops: identify first error, enable DLQ
(
errors.tolerance=all+ DLQ topic), then fix offending table. - Slow ingestion: increase tasks up to source limits; ensure topic partitions ≥ parallelism; watch sink bottlenecks first.
- Out-of-order within key: ensure producer uses key-aware partitioner; keep per-key ordering on the sink merge path.
Sink Verification (Duplicates / Missing Rows)
Duplicate Primary Keys
-- generic template (replace table/pk)
SELECT COUNT(*) AS rows,
COUNT(DISTINCT pk) AS distinct_keys
FROM target_table;
Latest-Wins Check
-- per business key, do we keep the latest op_ts/version?
SELECT key_col, MAX(op_ts) AS last_ts, COUNT(*) AS events
FROM staging_or_history
GROUP BY key_col
HAVING COUNT(*) <> 1 AND MAX(op_ts) < NOW() - INTERVAL '0 seconds';
If duplicates exist, convert the sink write to an idempotent MERGE keyed on a stable id + version/op_ts and re-run the last N events.
When to Escalate
- Source log gap detected (WAL/binlog/redo missing) and you can’t reconstruct from another source → plan a clean resnapshot.
- Schema/key change incompatible with current sink merge logic → schedule maintenance window for transform + backfill.
- Security/privileges prevent enabling required logging → involve DBAs (don’t keep retrying the connector).
Build Troubleshooting Muscle Memory
Want to practice these scenarios in a safe lab environment? Check out our Failure Scenario Drills — hands-on exercises that let you intentionally break your CDC pipeline and learn how to fix it.
- Backpressure simulation: Shut down sink, observe lag metrics
- DLQ exercise: Inject bad data, trigger serialization errors
- Schema drift: Test incompatible column changes
- Offset replay: Wipe offsets, observe snapshot + stream behavior
Running these drills quarterly keeps your incident response sharp and validates your monitoring setup.
Acceptance for “Stabilized”
- Consumer lag is flat or shrinking for 30 minutes
- Latest offsets/LSN/GTID/SCN are advancing
- DLQ is empty or only contains triaged, expected errors
- Duplicate-PK query shows rows == distinct_keys
- A controlled connector restart does not create new duplicates