First 15 Minutes: CDC Troubleshooting

A repeatable triage to stabilize incidents fast. Keep it pragmatic, measurable, and reversible.

Quick Triage (5–8 minutes)

Freeze changes that make state drift: pause backfills, schema changes, and connector restarts.
Define the failure mode: stuck (no progress), slow (lag growing), wrong (duplicates, missing rows), or crashing.
Bound the blast radius: single table/tenant vs global; snapshot vs stream.
Checkpoint evidence (timestamps, connector name, offsets, group id). Don’t tail logs without writing down the top lines you see.
Choose a safety: if logs are at risk of rotating out, extend retention (WAL/binlog/redo) before touching the pipeline.

Goal: stop data loss, capture proof, buy time. Detailed checks below.

Artifacts to Collect (Copy/Paste)

Versions: source db, connector (Debezium), broker (Kafka/Redpanda), sink
Connector config (sanitized): include snapshot mode, includes/excludes, heartbeat
Exact error lines: 20–50 lines around the first failure
Lag/offsets: consumer group lag and last committed LSN/GTID/SCN
Log retention settings: WAL/binlog/redo retention and current oldest log

Commands

Kafka (Consumer Lag)

kafka-consumer-groups --bootstrap-server <host:port> \
  --describe --group <your-sink-group>

Connect (Connector Status)

curl -s http://<connect-host:8083>/connectors | jq
curl -s http://<connect-host:8083>/connectors/<name>/status | jq

Stabilize First

Ensure source logs retained ≥ time to fix + snapshot duration
Enable DLQ (or equivalent) to prevent silent drops
Reduce parallelism if source is choking (snapshot & stream task counts)
If duplicates are appearing, convert sink writes to idempotent upserts (MERGE/UPSERT) immediately

Postgres Quick Checks

Configuration + Health

-- WAL & slots
SHOW wal_level;                -- should be 'logical'
SHOW max_wal_senders;          -- >= number of replication clients
SHOW max_replication_slots;    -- >= slots in use
SELECT slot_name, active, restart_lsn, confirmed_flush_lsn
FROM pg_replication_slots;

-- retention & pressure
SHOW wal_keep_size;            -- size hint for retained WAL
SELECT now() - pg_last_wal_replay_lsn()::text::pg_lsn; -- on standby

-- lag (if using logical slot)
SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_from_slot
FROM pg_replication_slots WHERE slot_name='<slot>';

Common Symptoms → First Aid

Snapshot stalls: reduce fetch size; whitelist fewer tables; ensure long-running txns aren’t blocking.
Slot disk pressure: sink caught behind → raise sink throughput or pause snapshot; never drop the slot without a plan.
Missing updates: replica identity not full on keyless tables → set REPLICA IDENTITY FULL where needed.

MySQL Quick Checks

Configuration + Health

-- binlog mode
SHOW VARIABLES LIKE 'binlog_format';       -- should be ROW
SHOW VARIABLES LIKE 'binlog_row_image';    -- FULL or MINIMAL (know your connector's needs)
SHOW VARIABLES LIKE 'gtid_mode';           -- ON preferred
SHOW MASTER STATUS;                         -- file/pos + executed GTID set

-- retention (8.0+)
SHOW VARIABLES LIKE 'binlog_expire_logs_seconds';

-- heartbeat/throughput
SHOW GLOBAL STATUS LIKE 'Binlog_cache_disk_use';

Common Symptoms → First Aid

Duplicates after restart: sink not idempotent → change inserts to MERGE/UPSERT with stable pk + version.
Lost history: binlogs expired during snapshot → extend expire window; restart from fresh snapshot.
DDL breakage: add include.schema.changes (if supported) and verify sink DDL policy.

Oracle Quick Checks

Configuration + Health

-- supplemental logging
SELECT supplemental_log_data_min, supplemental_log_data_all FROM v$database;
-- database log mode
SELECT log_mode FROM v$database;            -- ARCHIVELOG recommended for CDC
-- redo switch & archive status
SELECT sequence#, archived, status FROM v$log ORDER BY first_time DESC FETCH FIRST 5 ROWS ONLY;
-- confirm key columns logging for keyless tables (optional)
SELECT * FROM dba_log_groups WHERE LOG_GROUP_TYPE IN ('ALL COLUMN LOGGING','PRIMARY KEY LOGGING');

Common Symptoms → First Aid

High redo churn: throttle snapshot/table set; ensure filters aren’t too broad.
Missed updates: missing supplemental logging for key columns → add minimal or table-level logging, resnapshot.
Catalog mode mismatch after upgrade: re-check connector’s catalog strategy defaults before restart.

Kafka / Connect Quick Checks

Status & Lag

# list & inspect
curl -s http://<connect:8083>/connectors | jq
curl -s http://<connect:8083>/connectors/<name>/status | jq

# consumer lag (sink)
kafka-consumer-groups --bootstrap-server <host:port> \
  --describe --group <your-sink-group>

# safe restart of a sick task
curl -s -XPOST http://<connect:8083>/connectors/<name>/tasks/0/restart

Common Symptoms → First Aid

Connector crash loops: identify first error, enable DLQ (errors.tolerance=all + DLQ topic), then fix offending table.
Slow ingestion: increase tasks up to source limits; ensure topic partitions ≥ parallelism; watch sink bottlenecks first.
Out-of-order within key: ensure producer uses key-aware partitioner; keep per-key ordering on the sink merge path.

Sink Verification (Duplicates / Missing Rows)

Duplicate Primary Keys

-- generic template (replace table/pk)
SELECT COUNT(*) AS rows,
       COUNT(DISTINCT pk) AS distinct_keys
FROM target_table;

Latest-Wins Check

-- per business key, do we keep the latest op_ts/version?
SELECT key_col, MAX(op_ts) AS last_ts, COUNT(*) AS events
FROM staging_or_history
GROUP BY key_col
HAVING COUNT(*) <> 1 AND MAX(op_ts) < NOW() - INTERVAL '0 seconds';

If duplicates exist, convert the sink write to an idempotent MERGE keyed on a stable id + version/op_ts and re-run the last N events.

When to Escalate

Source log gap detected (WAL/binlog/redo missing) and you can’t reconstruct from another source → plan a clean resnapshot.
Schema/key change incompatible with current sink merge logic → schedule maintenance window for transform + backfill.
Security/privileges prevent enabling required logging → involve DBAs (don’t keep retrying the connector).

Build Troubleshooting Muscle Memory

Want to practice these scenarios in a safe lab environment? Check out our Failure Scenario Drills — hands-on exercises that let you intentionally break your CDC pipeline and learn how to fix it.

Backpressure simulation: Shut down sink, observe lag metrics
DLQ exercise: Inject bad data, trigger serialization errors
Schema drift: Test incompatible column changes
Offset replay: Wipe offsets, observe snapshot + stream behavior

Running these drills quarterly keeps your incident response sharp and validates your monitoring setup.

Acceptance for “Stabilized”

Consumer lag is flat or shrinking for 30 minutes
Latest offsets/LSN/GTID/SCN are advancing
DLQ is empty or only contains triaged, expected errors
Duplicate-PK query shows rows == distinct_keys
A controlled connector restart does not create new duplicates