First 15 Minutes: CDC Troubleshooting

A repeatable triage to stabilize incidents fast. Keep it pragmatic, measurable, and reversible.

Quick Triage (5–8 minutes)

  1. Freeze changes that make state drift: pause backfills, schema changes, and connector restarts.
  2. Define the failure mode: stuck (no progress), slow (lag growing), wrong (duplicates, missing rows), or crashing.
  3. Bound the blast radius: single table/tenant vs global; snapshot vs stream.
  4. Checkpoint evidence (timestamps, connector name, offsets, group id). Don’t tail logs without writing down the top lines you see.
  5. Choose a safety: if logs are at risk of rotating out, extend retention (WAL/binlog/redo) before touching the pipeline.

Goal: stop data loss, capture proof, buy time. Detailed checks below.

Artifacts to Collect (Copy/Paste)

  • Versions: source db, connector (Debezium), broker (Kafka/Redpanda), sink
  • Connector config (sanitized): include snapshot mode, includes/excludes, heartbeat
  • Exact error lines: 20–50 lines around the first failure
  • Lag/offsets: consumer group lag and last committed LSN/GTID/SCN
  • Log retention settings: WAL/binlog/redo retention and current oldest log
Commands

Kafka (Consumer Lag)

kafka-consumer-groups --bootstrap-server <host:port> \
  --describe --group <your-sink-group>

Connect (Connector Status)

curl -s http://<connect-host:8083>/connectors | jq
curl -s http://<connect-host:8083>/connectors/<name>/status | jq

Stabilize First

  • Ensure source logs retained ≥ time to fix + snapshot duration
  • Enable DLQ (or equivalent) to prevent silent drops
  • Reduce parallelism if source is choking (snapshot & stream task counts)
  • If duplicates are appearing, convert sink writes to idempotent upserts (MERGE/UPSERT) immediately

Postgres Quick Checks

Configuration + Health
-- WAL & slots
SHOW wal_level;                -- should be 'logical'
SHOW max_wal_senders;          -- >= number of replication clients
SHOW max_replication_slots;    -- >= slots in use
SELECT slot_name, active, restart_lsn, confirmed_flush_lsn
FROM pg_replication_slots;

-- retention & pressure
SHOW wal_keep_size;            -- size hint for retained WAL
SELECT now() - pg_last_wal_replay_lsn()::text::pg_lsn; -- on standby

-- lag (if using logical slot)
SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_from_slot
FROM pg_replication_slots WHERE slot_name='<slot>';
Common Symptoms → First Aid
  • Snapshot stalls: reduce fetch size; whitelist fewer tables; ensure long-running txns aren’t blocking.
  • Slot disk pressure: sink caught behind → raise sink throughput or pause snapshot; never drop the slot without a plan.
  • Missing updates: replica identity not full on keyless tables → set REPLICA IDENTITY FULL where needed.

MySQL Quick Checks

Configuration + Health
-- binlog mode
SHOW VARIABLES LIKE 'binlog_format';       -- should be ROW
SHOW VARIABLES LIKE 'binlog_row_image';    -- FULL or MINIMAL (know your connector's needs)
SHOW VARIABLES LIKE 'gtid_mode';           -- ON preferred
SHOW MASTER STATUS;                         -- file/pos + executed GTID set

-- retention (8.0+)
SHOW VARIABLES LIKE 'binlog_expire_logs_seconds';

-- heartbeat/throughput
SHOW GLOBAL STATUS LIKE 'Binlog_cache_disk_use';
Common Symptoms → First Aid
  • Duplicates after restart: sink not idempotent → change inserts to MERGE/UPSERT with stable pk + version.
  • Lost history: binlogs expired during snapshot → extend expire window; restart from fresh snapshot.
  • DDL breakage: add include.schema.changes (if supported) and verify sink DDL policy.

Oracle Quick Checks

Configuration + Health
-- supplemental logging
SELECT supplemental_log_data_min, supplemental_log_data_all FROM v$database;
-- database log mode
SELECT log_mode FROM v$database;            -- ARCHIVELOG recommended for CDC
-- redo switch & archive status
SELECT sequence#, archived, status FROM v$log ORDER BY first_time DESC FETCH FIRST 5 ROWS ONLY;
-- confirm key columns logging for keyless tables (optional)
SELECT * FROM dba_log_groups WHERE LOG_GROUP_TYPE IN ('ALL COLUMN LOGGING','PRIMARY KEY LOGGING');
Common Symptoms → First Aid
  • High redo churn: throttle snapshot/table set; ensure filters aren’t too broad.
  • Missed updates: missing supplemental logging for key columns → add minimal or table-level logging, resnapshot.
  • Catalog mode mismatch after upgrade: re-check connector’s catalog strategy defaults before restart.

Kafka / Connect Quick Checks

Status & Lag
# list & inspect
curl -s http://<connect:8083>/connectors | jq
curl -s http://<connect:8083>/connectors/<name>/status | jq

# consumer lag (sink)
kafka-consumer-groups --bootstrap-server <host:port> \
  --describe --group <your-sink-group>

# safe restart of a sick task
curl -s -XPOST http://<connect:8083>/connectors/<name>/tasks/0/restart
Common Symptoms → First Aid
  • Connector crash loops: identify first error, enable DLQ (errors.tolerance=all + DLQ topic), then fix offending table.
  • Slow ingestion: increase tasks up to source limits; ensure topic partitions ≥ parallelism; watch sink bottlenecks first.
  • Out-of-order within key: ensure producer uses key-aware partitioner; keep per-key ordering on the sink merge path.

Sink Verification (Duplicates / Missing Rows)

Duplicate Primary Keys

-- generic template (replace table/pk)
SELECT COUNT(*) AS rows,
       COUNT(DISTINCT pk) AS distinct_keys
FROM target_table;

Latest-Wins Check

-- per business key, do we keep the latest op_ts/version?
SELECT key_col, MAX(op_ts) AS last_ts, COUNT(*) AS events
FROM staging_or_history
GROUP BY key_col
HAVING COUNT(*) <> 1 AND MAX(op_ts) < NOW() - INTERVAL '0 seconds';

If duplicates exist, convert the sink write to an idempotent MERGE keyed on a stable id + version/op_ts and re-run the last N events.

When to Escalate

Build Troubleshooting Muscle Memory

Want to practice these scenarios in a safe lab environment? Check out our Failure Scenario Drills — hands-on exercises that let you intentionally break your CDC pipeline and learn how to fix it.

Running these drills quarterly keeps your incident response sharp and validates your monitoring setup.

Acceptance for “Stabilized”