Failure Scenario Drills
Build troubleshooting fluency by intentionally breaking your CDC pipeline in controlled ways. These drills teach you what to look for when things go wrong in production.
Why Run Failure Drills?
- Build muscle memory: Know what metrics spike before stakeholders notice
- Validate monitoring: Confirm your dashboards catch real problems
- Test recovery procedures: Practice rewinds and replays without pressure
- Document patterns: Create runbooks based on actual behavior, not guesswork
Run these in a dedicated lab environment. Never execute failure drills in production unless part of a controlled chaos engineering exercise with stakeholder approval.
Prerequisites
- Working CDC lab setup (see Hands-On Lab)
- Docker Compose stack running (Postgres, Kafka, Connect)
- Active Debezium connector streaming changes
- Access to Kafka CLI tools (
kafka-console-consumer,kafka-consumer-groups) - Monitoring dashboard or ability to query connector metrics
Backpressure Simulation: Sink Shutdown
Duration: 15 minLearning Goals
- Observe consumer lag metrics when downstream systems are unavailable
- Understand how Kafka buffers changes during sink outages
- Identify at what point lag becomes critical
- Practice monitoring and alerting thresholds
Setup
-
Baseline metrics: Record current consumer lag and throughput
kafka-consumer-groups --bootstrap-server localhost:9092 \ --describe --group sink-group-inventory - Enable DLQ (optional but recommended): Configure your sink connector with DLQ settings
Execute
-
Stop the sink connector:
# If using JDBC Sink curl -X DELETE http://localhost:8083/connectors/jdbc-sink-connector # Or pause it curl -X PUT http://localhost:8083/connectors/jdbc-sink-connector/pause -
Generate source changes: Insert, update, or delete rows in source database
docker exec -it postgres psql -U start_data_engineer -d inventory -c \ "INSERT INTO public.app_customer (name, email) VALUES ('Test User', 'test@example.com');" -
Watch lag grow: Monitor consumer group lag every 30 seconds
# Watch lag in real-time (Ctrl+C to stop) watch -n 5 "kafka-consumer-groups --bootstrap-server localhost:9092 \ --describe --group sink-group-inventory | grep -v 'Consumer group'"
Observe & Document
| Metric | Expected Behavior | Your Observation |
|---|---|---|
| Consumer Lag | Increases with each source change | |
| Source Connector Status | Remains RUNNING (still producing to Kafka) | |
| Topic Size | Messages accumulate in topic |
Recovery
# Resume or recreate the sink connector
curl -X PUT http://localhost:8083/connectors/jdbc-sink-connector/resume
# Or re-register it
curl -X POST http://localhost:8083/connectors -H "Content-Type: application/json" -d @sink-config.json
Validation Checklist
- Lag returns to zero within expected time
- All accumulated changes applied to sink
- No duplicate rows in sink (if using idempotent writes)
- Monitoring alerts fired when lag exceeded threshold
Production Implications
In production, backpressure from a slow or unavailable sink is one of the most common failure modes. Key considerations:
- Retention limits: Kafka topic retention must exceed maximum expected sink downtime
- Disk pressure: Monitor broker disk usage during extended sink outages
- Alert tuning: Set lag alerts based on acceptable data freshness SLAs
- Circuit breakers: Consider auto-pausing connectors if lag exceeds critical threshold
Dead Letter Queue: Serialization Errors
Duration: 20 minLearning Goals
- Trigger and identify serialization errors in the pipeline
- Configure and monitor Dead Letter Queue (DLQ) topics
- Practice DLQ triage and message recovery
- Understand error tolerance policies
Setup
-
Enable DLQ on sink connector: Update connector config
{ "name": "jdbc-sink-connector", "config": { "errors.tolerance": "all", "errors.deadletterqueue.topic.name": "dlq.sink.inventory", "errors.deadletterqueue.context.headers.enable": "true", "errors.log.enable": "true", "errors.log.include.messages": "true" } } -
Create a consumer for DLQ monitoring:
kafka-console-consumer --bootstrap-server localhost:9092 \ --topic dlq.sink.inventory --from-beginning \ --property print.headers=true \ --property print.timestamp=true
Execute
-
Method 1: Inject incompatible data type
# Assuming sink expects integer but source sends string docker exec -it postgres psql -U start_data_engineer -d inventory -c \ "INSERT INTO public.problematic_table (id, age) VALUES (999, 'not-a-number');" -
Method 2: Create missing target table scenario
# Rename target table to simulate missing destination # Note: postgres-sink is the sink database container (port 5433) docker exec -it postgres-sink psql -U sink_user -d warehouse -c \ "ALTER TABLE public.app_customer RENAME TO app_customer_backup;" -
Method 3: Insert oversized payload
# Insert a very large text field that exceeds sink column limit # (Assuming sink has VARCHAR(255) limit and this creates a 5000 char string) docker exec -it postgres psql -U start_data_engineer -d inventory -c \ "INSERT INTO public.app_customer (name, email) VALUES (repeat('X', 5000), 'large@example.com');"
Observe & Triage
-
Check connector status:
curl -s http://localhost:8083/connectors/jdbc-sink-connector/status | jq '.tasks[].trace' -
Inspect DLQ messages:
# List messages with error context kafka-console-consumer --bootstrap-server localhost:9092 \ --topic dlq.sink.inventory --from-beginning \ --property print.headers=true \ --max-messages 5 -
Identify error pattern: Look for headers like:
__connect.errors.topicβ Original topic__connect.errors.partitionβ Partition number__connect.errors.offsetβ Original offset__connect.errors.exception.class.nameβ Error type__connect.errors.exception.messageβ Error details
Recovery Strategies
Option A: Fix and Replay
- Fix the root cause (restore table, fix schema)
- Re-drive DLQ messages to original topic
- Let connector reprocess
# Use kafka-console-producer with headers
# Or use a DLQ replay tool
Option B: Transform and Retry
- Write script to transform bad records
- Publish corrected versions to bypass DLQ
- Archive original DLQ messages
Validation Checklist
- DLQ captured poison pill messages
- Connector remained running (didn't crash-loop)
- Error headers contain actionable diagnostics
- Good messages continued processing normally
- DLQ alerts fired when messages arrived
Production Best Practices
- Retention: Set DLQ topic retention to 7+ days for triage time
- Monitoring: Alert when DLQ message count exceeds zero
- Compaction: Disable log compaction on DLQ topics to preserve all errors
- Partitioning: Match DLQ partition count to source topic for easier correlation
- Privacy: Consider PII scrubbing in DLQ error context headers
Schema Drift: Incompatible Column Changes
Duration: 25 minLearning Goals
- Understand how schema changes propagate through CDC pipelines
- Identify breaking vs. non-breaking schema changes
- Practice schema evolution strategies
- Test schema registry integration (if applicable)
Setup
-
Document current schema:
docker exec -it postgres psql -U start_data_engineer -d inventory -c \ "\d public.app_customer" -
Baseline a known-good record:
docker exec -it postgres psql -U start_data_engineer -d inventory -c \ "SELECT * FROM public.app_customer LIMIT 1;"
Execute: Breaking Changes
-
Scenario A: Drop a column with active data
docker exec -it postgres psql -U start_data_engineer -d inventory -c \ "ALTER TABLE public.app_customer DROP COLUMN email;"Impact: Sink may fail if it expects
emailcolumn; Debezium emits event without field -
Scenario B: Change column type (incompatible)
# Change integer to text (potentially lossy) docker exec -it postgres psql -U start_data_engineer -d inventory -c \ "ALTER TABLE public.app_customer ALTER COLUMN id TYPE TEXT USING id::TEXT;"Impact: Type mismatch may cause sink errors if target schema is strict
-
Scenario C: Add NOT NULL constraint on existing nullable column
docker exec -it postgres psql -U start_data_engineer -d inventory -c \ "ALTER TABLE public.app_customer ADD COLUMN phone VARCHAR(20) NOT NULL DEFAULT '';"Impact: Existing rows get default; new inserts must provide value
Execute: Compatible Changes
-
Scenario D: Add nullable column (forward-compatible)
docker exec -it postgres psql -U start_data_engineer -d inventory -c \ "ALTER TABLE public.app_customer ADD COLUMN loyalty_points INTEGER DEFAULT 0;"β Safe: Sink ignores unknown columns or auto-adds if using schema sync
-
Scenario E: Increase column width
docker exec -it postgres psql -U start_data_engineer -d inventory -c \ "ALTER TABLE public.app_customer ALTER COLUMN name TYPE VARCHAR(500);"β Safe: Widening columns rarely breaks downstream systems
Observe Behavior
| Change Type | Connector Status | Message Schema | Sink Behavior |
|---|---|---|---|
| Drop column | |||
| Type change | |||
| Add NOT NULL | |||
| Add nullable |
Recovery Patterns
When Schema Drift Breaks the Pipeline
- Stop ingestion: Pause connector to prevent cascading failures
- Assess impact: Check how many messages are affected
- Choose strategy:
- Rollback: Revert schema change if caught early
- Forward fix: Align sink schema, restart connector
- Transform: Add SMT (Single Message Transform) to handle migration
- Validate: Confirm historical data integrity after fix
Validation Checklist
- Documented which changes are safe vs. breaking
- Tested sink resilience to unknown fields
- Verified DLQ captures schema mismatches
- Confirmed schema registry versioning (if using)
- Created runbook for schema change approval process
Schema Evolution Guidelines
Safe changes (forward-compatible):
- Add nullable columns with defaults
- Widen column types (VARCHAR(50) β VARCHAR(200))
- Add indexes (no CDC impact)
- Add optional columns to sink (if sink auto-adapts)
Risky changes (require coordination):
- Drop columns (data loss; sink may error)
- Rename columns (appears as drop + add)
- Change types (requires transformation logic)
- Add NOT NULL without defaults on high-volume tables
Best practice: Use a schema change approval process with CDC impact assessment before DDL execution.
Offset Wipe & Replay: Snapshot + Stream Behavior
Duration: 30 minLearning Goals
- Understand how CDC connectors manage offsets and state
- Observe snapshot vs. streaming mode behavior
- Practice safe offset resets and replays
- Validate idempotency of sink writes during replays
Setup
-
Ensure idempotent sink: Verify sink uses UPSERT/MERGE, not INSERT
# Check sink connector config curl -s http://localhost:8083/connectors/jdbc-sink-connector/config | \ jq '.["insert.mode"]' # should be "upsert" -
Backup current state:
# Export connector offsets (Kafka Connect stores in internal topic) kafka-console-consumer --bootstrap-server localhost:9092 \ --topic my_connect_offsets --from-beginning --timeout-ms 5000 > offsets-backup.json -
Record sink row count:
# Note: postgres-sink is the sink database container (port 5433) docker exec -it postgres-sink psql -U sink_user -d warehouse -c \ "SELECT COUNT(*) FROM public.app_customer;"
Execute: Offset Reset
-
Stop connector gracefully:
curl -X DELETE http://localhost:8083/connectors/inventory-connector curl -X DELETE http://localhost:8083/connectors/jdbc-sink-connector -
Wipe connector offsets:
# Delete Postgres replication slot (forces new snapshot) docker exec -it postgres psql -U start_data_engineer -d inventory -c \ "SELECT pg_drop_replication_slot('debezium');" # Delete Connect offsets for this connector # (In production, use offset management APIs; for lab, recreate with new name) -
Re-register source connector with fresh snapshot:
curl -X POST http://localhost:8083/connectors -H "Content-Type: application/json" -d '{ "name": "inventory-connector-v2", "config": { "connector.class": "io.debezium.connector.postgresql.PostgresConnector", "snapshot.mode": "initial", "database.hostname": "postgres", "database.port": "5432", "database.user": "start_data_engineer", "database.password": "password", "database.dbname": "inventory", "topic.prefix": "server2", "table.include.list": "public.app_customer", "slot.name": "debezium_v2" } }' -
Re-register sink connector to consume from new topic:
curl -X POST http://localhost:8083/connectors -H "Content-Type: application/json" -d '{ "name": "jdbc-sink-connector-v2", "config": { "connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector", "tasks.max": "1", "topics": "server2.public.app_customer", "connection.url": "jdbc:postgresql://postgres-sink:5432/warehouse", "connection.user": "sink_user", "connection.password": "password", "auto.create": "false", "auto.evolve": "true", "insert.mode": "upsert", "pk.mode": "record_key", "pk.fields": "id", "table.name.format": "app_customer", "transforms": "unwrap", "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState", "transforms.unwrap.drop.tombstones": "false", "transforms.unwrap.delete.handling.mode": "rewrite" } }'Note: The sink connector now consumes from
server2.public.app_customer(matching the new source connector's topic). Theinsert.mode=upsertensures idempotent writes during replay validation.
Observe Snapshot Phase
-
Watch connector task state:
watch -n 2 "curl -s http://localhost:8083/connectors/inventory-connector-v2/status | jq '.tasks[0].state'" -
Monitor snapshot progress:
# Check connector logs for snapshot messages docker logs -f connect 2>&1 | grep -i snapshot -
Count messages during snapshot:
kafka-console-consumer --bootstrap-server localhost:9092 \ --topic server2.public.app_customer --from-beginning \ --timeout-ms 10000 | wc -l
Observe Streaming Phase
-
Make a live change:
docker exec -it postgres psql -U start_data_engineer -d inventory -c \ "UPDATE public.app_customer SET name = 'Updated After Snapshot' WHERE id = 1;" -
Verify streaming event arrives:
kafka-console-consumer --bootstrap-server localhost:9092 \ --topic server2.public.app_customer \ --offset latest --partition 0 --timeout-ms 5000Look for
"op":"u"in the JSON payload for update operation
Validate Idempotency
| Check | Expected Result | Actual Result |
|---|---|---|
| Sink row count after replay | Same as before (no duplicates) | |
| Duplicate PK query | SELECT COUNT(*), COUNT(DISTINCT id) β equal |
|
| Latest values preserved | Most recent updates not overwritten by snapshot |
Duplicate Detection Query
-- Run on sink database
SELECT id, COUNT(*) AS occurrences
FROM public.app_customer
GROUP BY id
HAVING COUNT(*) > 1;
Expected: Zero rows (no duplicates)
Validation Checklist
- Connector completed snapshot phase (check logs)
- Streaming mode activated after snapshot
- Sink has no duplicate primary keys
- Latest updates preserved (not overwritten by snapshot data)
- Offset reset procedure documented in runbook
Production Offset Management
When to reset offsets:
- Sink data corruption requires full reload
- Connector offset corruption (rare)
- Major schema migration requiring resnapshot
- Testing disaster recovery procedures
Safeguards:
- Backup: Always export offsets before deletion
- Downtime window: Coordinate with stakeholders
- Idempotency: Ensure sink writes are idempotent (UPSERT/MERGE)
- Validation: Run duplicate-PK queries post-replay
- Monitoring: Watch lag and error rates during replay
Alternatives to full reset:
- Partial offset rewind (set offset to specific timestamp/LSN)
- Incremental resync (backfill only affected partitions)
- Shadow sync (parallel connector for validation before cutover)
Drill Progression & Mastery
Track your troubleshooting readiness across all four drills:
| Drill | First Run | Independent Repeat | Production Ready |
|---|---|---|---|
| Backpressure (Sink Shutdown) | |||
| DLQ (Serialization Errors) | |||
| Schema Drift | |||
| Offset Wipe & Replay |
Mastery criteria: Execute each drill independently, document observations, and create runbooks without referring to this guide.
Next Steps
- Create runbooks: Document your observations and recovery procedures for your team
- Automate checks: Script the validation queries for CI/CD integration
- Schedule regular drills: Run quarterly to maintain muscle memory
- Expand scenarios: Add org-specific failure modes (network partitions, cloud provider outages)
- Share learnings: Present findings to broader team to build collective knowledge
Safety Reminders
- π Never run drills in production without formal chaos engineering approval
- π Document everything: Screenshots, metrics, timestamps, error messages
- π₯ Pair up: Run drills with a colleague to compare observations
- β±οΈ Set time limits: Don't let exploration derail planned work
- π Clean up thoroughly: Restore lab to known-good state after each drill