Failure Scenario Drills — CDC Troubleshooting Practice

📦 Download Drill Bundle

Get all four failure drills as self-contained shell scripts. No more copy-pasting commands—run complete scenarios with a single script, perfect for workshops and training.

⬇️ Download drill-bundle.zip (12 KB) 📖 View README

📋 Package Details

Version: 1.0.0

SHA-256: 1106a755ccd513cea429aaf5d24e8432437e69e310e10f1d783bb1167be25de5

Contents: 4 drill scripts + README + config examples

Last Updated: 2025-11-07

Why Run Failure Drills?

Build muscle memory: Know what metrics spike before stakeholders notice
Validate monitoring: Confirm your dashboards catch real problems
Test recovery procedures: Practice rewinds and replays without pressure
Document patterns: Create runbooks based on actual behavior, not guesswork

Run these in a dedicated lab environment. Never execute failure drills in production unless part of a controlled chaos engineering exercise with stakeholder approval.

Prerequisites

Working CDC lab setup (see Hands-On Lab)
Docker Compose stack running (Postgres, Kafka, Connect)
Active Debezium connector streaming changes
Access to Kafka CLI tools (kafka-console-consumer, kafka-consumer-groups)
Monitoring dashboard or ability to query connector metrics

Drill #1

Backpressure Simulation: Sink Shutdown

Duration: 15 min

Learning Goals

Observe consumer lag metrics when downstream systems are unavailable
Understand how Kafka buffers changes during sink outages
Identify at what point lag becomes critical
Practice monitoring and alerting thresholds

Setup

Baseline metrics: Record current consumer lag and throughput

kafka-consumer-groups --bootstrap-server localhost:9092 \
  --describe --group sink-group-inventory

Enable DLQ (optional but recommended): Configure your sink connector with DLQ settings

Execute

Stop the sink connector:

# If using JDBC Sink
curl -X DELETE http://localhost:8083/connectors/jdbc-sink-connector

# Or pause it
curl -X PUT http://localhost:8083/connectors/jdbc-sink-connector/pause

Generate source changes: Insert, update, or delete rows in source database

docker exec -it postgres psql -U start_data_engineer -d inventory -c \
  "INSERT INTO public.app_customer (name, email) VALUES ('Test User', 'test@example.com');"

Watch lag grow: Monitor consumer group lag every 30 seconds

# Watch lag in real-time (Ctrl+C to stop)
watch -n 5 "kafka-consumer-groups --bootstrap-server localhost:9092 \
  --describe --group sink-group-inventory | grep -v 'Consumer group'"

Observe & Document

Metric	Expected Behavior	Your Observation
Consumer Lag	Increases with each source change
Source Connector Status	Remains RUNNING (still producing to Kafka)
Topic Size	Messages accumulate in topic

Recovery

# Resume or recreate the sink connector
curl -X PUT http://localhost:8083/connectors/jdbc-sink-connector/resume

# Or re-register it
curl -X POST http://localhost:8083/connectors -H "Content-Type: application/json" -d @sink-config.json

Validation Checklist

Lag returns to zero within expected time
All accumulated changes applied to sink
No duplicate rows in sink (if using idempotent writes)
Monitoring alerts fired when lag exceeded threshold

Production Implications

In production, backpressure from a slow or unavailable sink is one of the most common failure modes. Key considerations:

Retention limits: Kafka topic retention must exceed maximum expected sink downtime
Disk pressure: Monitor broker disk usage during extended sink outages
Alert tuning: Set lag alerts based on acceptable data freshness SLAs
Circuit breakers: Consider auto-pausing connectors if lag exceeds critical threshold

Drill #2

Dead Letter Queue: Serialization Errors

Duration: 20 min

Learning Goals

Trigger and identify serialization errors in the pipeline
Configure and monitor Dead Letter Queue (DLQ) topics
Practice DLQ triage and message recovery
Understand error tolerance policies

Setup

Enable DLQ on sink connector: Update connector config

{
  "name": "jdbc-sink-connector",
  "config": {
    "errors.tolerance": "all",
    "errors.deadletterqueue.topic.name": "dlq.sink.inventory",
    "errors.deadletterqueue.context.headers.enable": "true",
    "errors.log.enable": "true",
    "errors.log.include.messages": "true"
  }
}

Create a consumer for DLQ monitoring:

kafka-console-consumer --bootstrap-server localhost:9092 \
  --topic dlq.sink.inventory --from-beginning \
  --property print.headers=true \
  --property print.timestamp=true

Execute

Method 1: Inject incompatible data type

# Assuming sink expects integer but source sends string
docker exec -it postgres psql -U start_data_engineer -d inventory -c \
  "INSERT INTO public.problematic_table (id, age) VALUES (999, 'not-a-number');"

Method 2: Create missing target table scenario

# Rename target table to simulate missing destination
# Note: postgres-sink is the sink database container (port 5433)
docker exec -it postgres-sink psql -U sink_user -d warehouse -c \
  "ALTER TABLE public.app_customer RENAME TO app_customer_backup;"

Method 3: Insert oversized payload

# Insert a very large text field that exceeds sink column limit
# (Assuming sink has VARCHAR(255) limit and this creates a 5000 char string)
docker exec -it postgres psql -U start_data_engineer -d inventory -c \
  "INSERT INTO public.app_customer (name, email) VALUES (repeat('X', 5000), 'large@example.com');"

Observe & Triage

Check connector status:

curl -s http://localhost:8083/connectors/jdbc-sink-connector/status | jq '.tasks[].trace'

Inspect DLQ messages:

# List messages with error context
kafka-console-consumer --bootstrap-server localhost:9092 \
  --topic dlq.sink.inventory --from-beginning \
  --property print.headers=true \
  --max-messages 5

Identify error pattern: Look for headers like:
- __connect.errors.topic — Original topic
- __connect.errors.partition — Partition number
- __connect.errors.offset — Original offset
- __connect.errors.exception.class.name — Error type
- __connect.errors.exception.message — Error details

Recovery Strategies

Option A: Fix and Replay

Fix the root cause (restore table, fix schema)
Re-drive DLQ messages to original topic
Let connector reprocess

# Use kafka-console-producer with headers
# Or use a DLQ replay tool

Option B: Transform and Retry

Write script to transform bad records
Publish corrected versions to bypass DLQ
Archive original DLQ messages

Validation Checklist

DLQ captured poison pill messages
Connector remained running (didn't crash-loop)
Error headers contain actionable diagnostics
Good messages continued processing normally
DLQ alerts fired when messages arrived

Production Best Practices

Retention: Set DLQ topic retention to 7+ days for triage time
Monitoring: Alert when DLQ message count exceeds zero
Compaction: Disable log compaction on DLQ topics to preserve all errors
Partitioning: Match DLQ partition count to source topic for easier correlation
Privacy: Consider PII scrubbing in DLQ error context headers

Drill #3

Schema Drift: Incompatible Column Changes

Duration: 25 min

Learning Goals

Understand how schema changes propagate through CDC pipelines
Identify breaking vs. non-breaking schema changes
Practice schema evolution strategies
Test schema registry integration (if applicable)

Setup

Document current schema:

docker exec -it postgres psql -U start_data_engineer -d inventory -c \
  "\d public.app_customer"

Baseline a known-good record:

docker exec -it postgres psql -U start_data_engineer -d inventory -c \
  "SELECT * FROM public.app_customer LIMIT 1;"

Execute: Breaking Changes

Scenario A: Drop a column with active data

docker exec -it postgres psql -U start_data_engineer -d inventory -c \
  "ALTER TABLE public.app_customer DROP COLUMN email;"

Impact: Sink may fail if it expects email column; Debezium emits event without field

Scenario B: Change column type (incompatible)

# Change integer to text (potentially lossy)
docker exec -it postgres psql -U start_data_engineer -d inventory -c \
  "ALTER TABLE public.app_customer ALTER COLUMN id TYPE TEXT USING id::TEXT;"

Impact: Type mismatch may cause sink errors if target schema is strict

Scenario C: Add NOT NULL constraint on existing nullable column

docker exec -it postgres psql -U start_data_engineer -d inventory -c \
  "ALTER TABLE public.app_customer ADD COLUMN phone VARCHAR(20) NOT NULL DEFAULT '';"

Impact: Existing rows get default; new inserts must provide value

Execute: Compatible Changes

Scenario D: Add nullable column (forward-compatible)

docker exec -it postgres psql -U start_data_engineer -d inventory -c \
  "ALTER TABLE public.app_customer ADD COLUMN loyalty_points INTEGER DEFAULT 0;"

✓ Safe: Sink ignores unknown columns or auto-adds if using schema sync

Scenario E: Increase column width

docker exec -it postgres psql -U start_data_engineer -d inventory -c \
  "ALTER TABLE public.app_customer ALTER COLUMN name TYPE VARCHAR(500);"

✓ Safe: Widening columns rarely breaks downstream systems

Observe Behavior

Change Type	Connector Status	Message Schema	Sink Behavior
Drop column
Type change
Add NOT NULL
Add nullable

Recovery Patterns

When Schema Drift Breaks the Pipeline

Stop ingestion: Pause connector to prevent cascading failures
Assess impact: Check how many messages are affected
Choose strategy:
- Rollback: Revert schema change if caught early
- Forward fix: Align sink schema, restart connector
- Transform: Add SMT (Single Message Transform) to handle migration
Validate: Confirm historical data integrity after fix

Validation Checklist

Documented which changes are safe vs. breaking
Tested sink resilience to unknown fields
Verified DLQ captures schema mismatches
Confirmed schema registry versioning (if using)
Created runbook for schema change approval process

Schema Evolution Guidelines

Safe changes (forward-compatible):

Add nullable columns with defaults
Widen column types (VARCHAR(50) → VARCHAR(200))
Add indexes (no CDC impact)
Add optional columns to sink (if sink auto-adapts)

Risky changes (require coordination):

Drop columns (data loss; sink may error)
Rename columns (appears as drop + add)
Change types (requires transformation logic)
Add NOT NULL without defaults on high-volume tables

Best practice: Use a schema change approval process with CDC impact assessment before DDL execution.

Drill #4

Offset Wipe & Replay: Snapshot + Stream Behavior

Duration: 30 min

Learning Goals

Understand how CDC connectors manage offsets and state
Observe snapshot vs. streaming mode behavior
Practice safe offset resets and replays
Validate idempotency of sink writes during replays

Setup

Ensure idempotent sink: Verify sink uses UPSERT/MERGE, not INSERT

# Check sink connector config
curl -s http://localhost:8083/connectors/jdbc-sink-connector/config | \
  jq '.["insert.mode"]'  # should be "upsert"

Backup current state:

# Export connector offsets (Kafka Connect stores in internal topic)
kafka-console-consumer --bootstrap-server localhost:9092 \
  --topic my_connect_offsets --from-beginning --timeout-ms 5000 > offsets-backup.json

Record sink row count:

# Note: postgres-sink is the sink database container (port 5433)
docker exec -it postgres-sink psql -U sink_user -d warehouse -c \
  "SELECT COUNT(*) FROM public.app_customer;"

Execute: Offset Reset

Stop connector gracefully:

curl -X DELETE http://localhost:8083/connectors/inventory-connector
curl -X DELETE http://localhost:8083/connectors/jdbc-sink-connector

Wipe connector offsets:

# Delete Postgres replication slot (forces new snapshot)
docker exec -it postgres psql -U start_data_engineer -d inventory -c \
  "SELECT pg_drop_replication_slot('debezium');"

# Delete Connect offsets for this connector
# (In production, use offset management APIs; for lab, recreate with new name)

Re-register source connector with fresh snapshot:

curl -X POST http://localhost:8083/connectors -H "Content-Type: application/json" -d '{
  "name": "inventory-connector-v2",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "snapshot.mode": "initial",
    "database.hostname": "postgres",
    "database.port": "5432",
    "database.user": "start_data_engineer",
    "database.password": "password",
    "database.dbname": "inventory",
    "topic.prefix": "server2",
    "table.include.list": "public.app_customer",
    "slot.name": "debezium_v2"
  }
}'

Re-register sink connector to consume from new topic:

curl -X POST http://localhost:8083/connectors -H "Content-Type: application/json" -d '{
  "name": "jdbc-sink-connector-v2",
  "config": {
    "connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
    "tasks.max": "1",
    "topics": "server2.public.app_customer",
    "connection.url": "jdbc:postgresql://postgres-sink:5432/warehouse",
    "connection.user": "sink_user",
    "connection.password": "password",
    "auto.create": "false",
    "auto.evolve": "true",
    "insert.mode": "upsert",
    "pk.mode": "record_key",
    "pk.fields": "id",
    "table.name.format": "app_customer",
    "transforms": "unwrap",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.drop.tombstones": "false",
    "transforms.unwrap.delete.handling.mode": "rewrite"
  }
}'

Note: The sink connector now consumes from server2.public.app_customer (matching the new source connector's topic). The insert.mode=upsert ensures idempotent writes during replay validation.

Observe Snapshot Phase

Watch connector task state:

watch -n 2 "curl -s http://localhost:8083/connectors/inventory-connector-v2/status | jq '.tasks[0].state'"

Monitor snapshot progress:

# Check connector logs for snapshot messages
docker logs -f connect 2>&1 | grep -i snapshot

Count messages during snapshot:

kafka-console-consumer --bootstrap-server localhost:9092 \
  --topic server2.public.app_customer --from-beginning \
  --timeout-ms 10000 | wc -l

Observe Streaming Phase

Make a live change:

docker exec -it postgres psql -U start_data_engineer -d inventory -c \
  "UPDATE public.app_customer SET name = 'Updated After Snapshot' WHERE id = 1;"

Verify streaming event arrives:

kafka-console-consumer --bootstrap-server localhost:9092 \
  --topic server2.public.app_customer \
  --offset latest --partition 0 --timeout-ms 5000

Look for "op":"u" in the JSON payload for update operation

Validate Idempotency

Check	Expected Result	Actual Result
Sink row count after replay	Same as before (no duplicates)
Duplicate PK query	`SELECT COUNT(*), COUNT(DISTINCT id)` → equal
Latest values preserved	Most recent updates not overwritten by snapshot

Duplicate Detection Query

-- Run on sink database
SELECT id, COUNT(*) AS occurrences
FROM public.app_customer
GROUP BY id
HAVING COUNT(*) > 1;

Expected: Zero rows (no duplicates)

Validation Checklist

Connector completed snapshot phase (check logs)
Streaming mode activated after snapshot
Sink has no duplicate primary keys
Latest updates preserved (not overwritten by snapshot data)
Offset reset procedure documented in runbook

Production Offset Management

When to reset offsets:

Sink data corruption requires full reload
Connector offset corruption (rare)
Major schema migration requiring resnapshot
Testing disaster recovery procedures

Safeguards:

Backup: Always export offsets before deletion
Downtime window: Coordinate with stakeholders
Idempotency: Ensure sink writes are idempotent (UPSERT/MERGE)
Validation: Run duplicate-PK queries post-replay
Monitoring: Watch lag and error rates during replay

Alternatives to full reset:

Partial offset rewind (set offset to specific timestamp/LSN)
Incremental resync (backfill only affected partitions)
Shadow sync (parallel connector for validation before cutover)

Drill Progression & Mastery

Track your troubleshooting readiness across all four drills:

Drill	First Run	Independent Repeat	Production Ready
Backpressure (Sink Shutdown)
DLQ (Serialization Errors)
Schema Drift
Offset Wipe & Replay

Mastery criteria: Execute each drill independently, document observations, and create runbooks without referring to this guide.

Next Steps

Create runbooks: Document your observations and recovery procedures for your team
Automate checks: Script the validation queries for CI/CD integration
Schedule regular drills: Run quarterly to maintain muscle memory
Expand scenarios: Add org-specific failure modes (network partitions, cloud provider outages)
Share learnings: Present findings to broader team to build collective knowledge

Troubleshooting Guide Observability Basics Offset Management Reconciliation & Offset Surgery

Safety Reminders

🔒 Never run drills in production without formal chaos engineering approval
📋 Document everything: Screenshots, metrics, timestamps, error messages
👥 Pair up: Run drills with a colleague to compare observations
⏱️ Set time limits: Don't let exploration derail planned work
🔄 Clean up thoroughly: Restore lab to known-good state after each drill