Failure Scenario Drills

Build troubleshooting fluency by intentionally breaking your CDC pipeline in controlled ways. These drills teach you what to look for when things go wrong in production.

πŸ“¦ Download Drill Bundle

Get all four failure drills as self-contained shell scripts. No more copy-pasting commandsβ€”run complete scenarios with a single script, perfect for workshops and training.

πŸ“‹ Package Details

Version: 1.0.0

SHA-256: 1106a755ccd513cea429aaf5d24e8432437e69e310e10f1d783bb1167be25de5

Contents: 4 drill scripts + README + config examples

Last Updated: 2025-11-07

Self-Contained
Shell Scripts

Why Run Failure Drills?

  • Build muscle memory: Know what metrics spike before stakeholders notice
  • Validate monitoring: Confirm your dashboards catch real problems
  • Test recovery procedures: Practice rewinds and replays without pressure
  • Document patterns: Create runbooks based on actual behavior, not guesswork

Run these in a dedicated lab environment. Never execute failure drills in production unless part of a controlled chaos engineering exercise with stakeholder approval.

Prerequisites

  • Working CDC lab setup (see Hands-On Lab)
  • Docker Compose stack running (Postgres, Kafka, Connect)
  • Active Debezium connector streaming changes
  • Access to Kafka CLI tools (kafka-console-consumer, kafka-consumer-groups)
  • Monitoring dashboard or ability to query connector metrics
Drill #1

Backpressure Simulation: Sink Shutdown

Duration: 15 min

Learning Goals

  • Observe consumer lag metrics when downstream systems are unavailable
  • Understand how Kafka buffers changes during sink outages
  • Identify at what point lag becomes critical
  • Practice monitoring and alerting thresholds

Setup

  1. Baseline metrics: Record current consumer lag and throughput
    kafka-consumer-groups --bootstrap-server localhost:9092 \
      --describe --group sink-group-inventory
  2. Enable DLQ (optional but recommended): Configure your sink connector with DLQ settings

Execute

  1. Stop the sink connector:
    # If using JDBC Sink
    curl -X DELETE http://localhost:8083/connectors/jdbc-sink-connector
    
    # Or pause it
    curl -X PUT http://localhost:8083/connectors/jdbc-sink-connector/pause
  2. Generate source changes: Insert, update, or delete rows in source database
    docker exec -it postgres psql -U start_data_engineer -d inventory -c \
      "INSERT INTO public.app_customer (name, email) VALUES ('Test User', 'test@example.com');"
  3. Watch lag grow: Monitor consumer group lag every 30 seconds
    # Watch lag in real-time (Ctrl+C to stop)
    watch -n 5 "kafka-consumer-groups --bootstrap-server localhost:9092 \
      --describe --group sink-group-inventory | grep -v 'Consumer group'"

Observe & Document

Metric Expected Behavior Your Observation
Consumer Lag Increases with each source change
Source Connector Status Remains RUNNING (still producing to Kafka)
Topic Size Messages accumulate in topic

Recovery

# Resume or recreate the sink connector
curl -X PUT http://localhost:8083/connectors/jdbc-sink-connector/resume

# Or re-register it
curl -X POST http://localhost:8083/connectors -H "Content-Type: application/json" -d @sink-config.json

Validation Checklist

  • Lag returns to zero within expected time
  • All accumulated changes applied to sink
  • No duplicate rows in sink (if using idempotent writes)
  • Monitoring alerts fired when lag exceeded threshold
Production Implications

In production, backpressure from a slow or unavailable sink is one of the most common failure modes. Key considerations:

  • Retention limits: Kafka topic retention must exceed maximum expected sink downtime
  • Disk pressure: Monitor broker disk usage during extended sink outages
  • Alert tuning: Set lag alerts based on acceptable data freshness SLAs
  • Circuit breakers: Consider auto-pausing connectors if lag exceeds critical threshold
Drill #2

Dead Letter Queue: Serialization Errors

Duration: 20 min

Learning Goals

  • Trigger and identify serialization errors in the pipeline
  • Configure and monitor Dead Letter Queue (DLQ) topics
  • Practice DLQ triage and message recovery
  • Understand error tolerance policies

Setup

  1. Enable DLQ on sink connector: Update connector config
    {
      "name": "jdbc-sink-connector",
      "config": {
        "errors.tolerance": "all",
        "errors.deadletterqueue.topic.name": "dlq.sink.inventory",
        "errors.deadletterqueue.context.headers.enable": "true",
        "errors.log.enable": "true",
        "errors.log.include.messages": "true"
      }
    }
  2. Create a consumer for DLQ monitoring:
    kafka-console-consumer --bootstrap-server localhost:9092 \
      --topic dlq.sink.inventory --from-beginning \
      --property print.headers=true \
      --property print.timestamp=true

Execute

  1. Method 1: Inject incompatible data type
    # Assuming sink expects integer but source sends string
    docker exec -it postgres psql -U start_data_engineer -d inventory -c \
      "INSERT INTO public.problematic_table (id, age) VALUES (999, 'not-a-number');"
  2. Method 2: Create missing target table scenario
    # Rename target table to simulate missing destination
    # Note: postgres-sink is the sink database container (port 5433)
    docker exec -it postgres-sink psql -U sink_user -d warehouse -c \
      "ALTER TABLE public.app_customer RENAME TO app_customer_backup;"
  3. Method 3: Insert oversized payload
    # Insert a very large text field that exceeds sink column limit
    # (Assuming sink has VARCHAR(255) limit and this creates a 5000 char string)
    docker exec -it postgres psql -U start_data_engineer -d inventory -c \
      "INSERT INTO public.app_customer (name, email) VALUES (repeat('X', 5000), 'large@example.com');"

Observe & Triage

  1. Check connector status:
    curl -s http://localhost:8083/connectors/jdbc-sink-connector/status | jq '.tasks[].trace'
  2. Inspect DLQ messages:
    # List messages with error context
    kafka-console-consumer --bootstrap-server localhost:9092 \
      --topic dlq.sink.inventory --from-beginning \
      --property print.headers=true \
      --max-messages 5
  3. Identify error pattern: Look for headers like:
    • __connect.errors.topic β€” Original topic
    • __connect.errors.partition β€” Partition number
    • __connect.errors.offset β€” Original offset
    • __connect.errors.exception.class.name β€” Error type
    • __connect.errors.exception.message β€” Error details

Recovery Strategies

Option A: Fix and Replay

  1. Fix the root cause (restore table, fix schema)
  2. Re-drive DLQ messages to original topic
  3. Let connector reprocess
# Use kafka-console-producer with headers
# Or use a DLQ replay tool

Option B: Transform and Retry

  1. Write script to transform bad records
  2. Publish corrected versions to bypass DLQ
  3. Archive original DLQ messages

Validation Checklist

  • DLQ captured poison pill messages
  • Connector remained running (didn't crash-loop)
  • Error headers contain actionable diagnostics
  • Good messages continued processing normally
  • DLQ alerts fired when messages arrived
Production Best Practices
  • Retention: Set DLQ topic retention to 7+ days for triage time
  • Monitoring: Alert when DLQ message count exceeds zero
  • Compaction: Disable log compaction on DLQ topics to preserve all errors
  • Partitioning: Match DLQ partition count to source topic for easier correlation
  • Privacy: Consider PII scrubbing in DLQ error context headers
Drill #3

Schema Drift: Incompatible Column Changes

Duration: 25 min

Learning Goals

  • Understand how schema changes propagate through CDC pipelines
  • Identify breaking vs. non-breaking schema changes
  • Practice schema evolution strategies
  • Test schema registry integration (if applicable)

Setup

  1. Document current schema:
    docker exec -it postgres psql -U start_data_engineer -d inventory -c \
      "\d public.app_customer"
  2. Baseline a known-good record:
    docker exec -it postgres psql -U start_data_engineer -d inventory -c \
      "SELECT * FROM public.app_customer LIMIT 1;"

Execute: Breaking Changes

  1. Scenario A: Drop a column with active data
    docker exec -it postgres psql -U start_data_engineer -d inventory -c \
      "ALTER TABLE public.app_customer DROP COLUMN email;"

    Impact: Sink may fail if it expects email column; Debezium emits event without field

  2. Scenario B: Change column type (incompatible)
    # Change integer to text (potentially lossy)
    docker exec -it postgres psql -U start_data_engineer -d inventory -c \
      "ALTER TABLE public.app_customer ALTER COLUMN id TYPE TEXT USING id::TEXT;"

    Impact: Type mismatch may cause sink errors if target schema is strict

  3. Scenario C: Add NOT NULL constraint on existing nullable column
    docker exec -it postgres psql -U start_data_engineer -d inventory -c \
      "ALTER TABLE public.app_customer ADD COLUMN phone VARCHAR(20) NOT NULL DEFAULT '';"

    Impact: Existing rows get default; new inserts must provide value

Execute: Compatible Changes

  1. Scenario D: Add nullable column (forward-compatible)
    docker exec -it postgres psql -U start_data_engineer -d inventory -c \
      "ALTER TABLE public.app_customer ADD COLUMN loyalty_points INTEGER DEFAULT 0;"

    βœ“ Safe: Sink ignores unknown columns or auto-adds if using schema sync

  2. Scenario E: Increase column width
    docker exec -it postgres psql -U start_data_engineer -d inventory -c \
      "ALTER TABLE public.app_customer ALTER COLUMN name TYPE VARCHAR(500);"

    βœ“ Safe: Widening columns rarely breaks downstream systems

Observe Behavior

Change Type Connector Status Message Schema Sink Behavior
Drop column
Type change
Add NOT NULL
Add nullable

Recovery Patterns

When Schema Drift Breaks the Pipeline

  1. Stop ingestion: Pause connector to prevent cascading failures
  2. Assess impact: Check how many messages are affected
  3. Choose strategy:
    • Rollback: Revert schema change if caught early
    • Forward fix: Align sink schema, restart connector
    • Transform: Add SMT (Single Message Transform) to handle migration
  4. Validate: Confirm historical data integrity after fix

Validation Checklist

  • Documented which changes are safe vs. breaking
  • Tested sink resilience to unknown fields
  • Verified DLQ captures schema mismatches
  • Confirmed schema registry versioning (if using)
  • Created runbook for schema change approval process
Schema Evolution Guidelines

Safe changes (forward-compatible):

  • Add nullable columns with defaults
  • Widen column types (VARCHAR(50) β†’ VARCHAR(200))
  • Add indexes (no CDC impact)
  • Add optional columns to sink (if sink auto-adapts)

Risky changes (require coordination):

  • Drop columns (data loss; sink may error)
  • Rename columns (appears as drop + add)
  • Change types (requires transformation logic)
  • Add NOT NULL without defaults on high-volume tables

Best practice: Use a schema change approval process with CDC impact assessment before DDL execution.

Drill #4

Offset Wipe & Replay: Snapshot + Stream Behavior

Duration: 30 min

Learning Goals

  • Understand how CDC connectors manage offsets and state
  • Observe snapshot vs. streaming mode behavior
  • Practice safe offset resets and replays
  • Validate idempotency of sink writes during replays

Setup

  1. Ensure idempotent sink: Verify sink uses UPSERT/MERGE, not INSERT
    # Check sink connector config
    curl -s http://localhost:8083/connectors/jdbc-sink-connector/config | \
      jq '.["insert.mode"]'  # should be "upsert"
  2. Backup current state:
    # Export connector offsets (Kafka Connect stores in internal topic)
    kafka-console-consumer --bootstrap-server localhost:9092 \
      --topic my_connect_offsets --from-beginning --timeout-ms 5000 > offsets-backup.json
  3. Record sink row count:
    # Note: postgres-sink is the sink database container (port 5433)
    docker exec -it postgres-sink psql -U sink_user -d warehouse -c \
      "SELECT COUNT(*) FROM public.app_customer;"

Execute: Offset Reset

  1. Stop connector gracefully:
    curl -X DELETE http://localhost:8083/connectors/inventory-connector
    curl -X DELETE http://localhost:8083/connectors/jdbc-sink-connector
  2. Wipe connector offsets:
    # Delete Postgres replication slot (forces new snapshot)
    docker exec -it postgres psql -U start_data_engineer -d inventory -c \
      "SELECT pg_drop_replication_slot('debezium');"
    
    # Delete Connect offsets for this connector
    # (In production, use offset management APIs; for lab, recreate with new name)
  3. Re-register source connector with fresh snapshot:
    curl -X POST http://localhost:8083/connectors -H "Content-Type: application/json" -d '{
      "name": "inventory-connector-v2",
      "config": {
        "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
        "snapshot.mode": "initial",
        "database.hostname": "postgres",
        "database.port": "5432",
        "database.user": "start_data_engineer",
        "database.password": "password",
        "database.dbname": "inventory",
        "topic.prefix": "server2",
        "table.include.list": "public.app_customer",
        "slot.name": "debezium_v2"
      }
    }'
  4. Re-register sink connector to consume from new topic:
    curl -X POST http://localhost:8083/connectors -H "Content-Type: application/json" -d '{
      "name": "jdbc-sink-connector-v2",
      "config": {
        "connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
        "tasks.max": "1",
        "topics": "server2.public.app_customer",
        "connection.url": "jdbc:postgresql://postgres-sink:5432/warehouse",
        "connection.user": "sink_user",
        "connection.password": "password",
        "auto.create": "false",
        "auto.evolve": "true",
        "insert.mode": "upsert",
        "pk.mode": "record_key",
        "pk.fields": "id",
        "table.name.format": "app_customer",
        "transforms": "unwrap",
        "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
        "transforms.unwrap.drop.tombstones": "false",
        "transforms.unwrap.delete.handling.mode": "rewrite"
      }
    }'

    Note: The sink connector now consumes from server2.public.app_customer (matching the new source connector's topic). The insert.mode=upsert ensures idempotent writes during replay validation.

Observe Snapshot Phase

  1. Watch connector task state:
    watch -n 2 "curl -s http://localhost:8083/connectors/inventory-connector-v2/status | jq '.tasks[0].state'"
  2. Monitor snapshot progress:
    # Check connector logs for snapshot messages
    docker logs -f connect 2>&1 | grep -i snapshot
  3. Count messages during snapshot:
    kafka-console-consumer --bootstrap-server localhost:9092 \
      --topic server2.public.app_customer --from-beginning \
      --timeout-ms 10000 | wc -l

Observe Streaming Phase

  1. Make a live change:
    docker exec -it postgres psql -U start_data_engineer -d inventory -c \
      "UPDATE public.app_customer SET name = 'Updated After Snapshot' WHERE id = 1;"
  2. Verify streaming event arrives:
    kafka-console-consumer --bootstrap-server localhost:9092 \
      --topic server2.public.app_customer \
      --offset latest --partition 0 --timeout-ms 5000

    Look for "op":"u" in the JSON payload for update operation

Validate Idempotency

Check Expected Result Actual Result
Sink row count after replay Same as before (no duplicates)
Duplicate PK query SELECT COUNT(*), COUNT(DISTINCT id) β†’ equal
Latest values preserved Most recent updates not overwritten by snapshot

Duplicate Detection Query

-- Run on sink database
SELECT id, COUNT(*) AS occurrences
FROM public.app_customer
GROUP BY id
HAVING COUNT(*) > 1;

Expected: Zero rows (no duplicates)

Validation Checklist

  • Connector completed snapshot phase (check logs)
  • Streaming mode activated after snapshot
  • Sink has no duplicate primary keys
  • Latest updates preserved (not overwritten by snapshot data)
  • Offset reset procedure documented in runbook
Production Offset Management

When to reset offsets:

  • Sink data corruption requires full reload
  • Connector offset corruption (rare)
  • Major schema migration requiring resnapshot
  • Testing disaster recovery procedures

Safeguards:

  • Backup: Always export offsets before deletion
  • Downtime window: Coordinate with stakeholders
  • Idempotency: Ensure sink writes are idempotent (UPSERT/MERGE)
  • Validation: Run duplicate-PK queries post-replay
  • Monitoring: Watch lag and error rates during replay

Alternatives to full reset:

  • Partial offset rewind (set offset to specific timestamp/LSN)
  • Incremental resync (backfill only affected partitions)
  • Shadow sync (parallel connector for validation before cutover)

Drill Progression & Mastery

Track your troubleshooting readiness across all four drills:

Drill First Run Independent Repeat Production Ready
Backpressure (Sink Shutdown)
DLQ (Serialization Errors)
Schema Drift
Offset Wipe & Replay

Mastery criteria: Execute each drill independently, document observations, and create runbooks without referring to this guide.

Next Steps

  • Create runbooks: Document your observations and recovery procedures for your team
  • Automate checks: Script the validation queries for CI/CD integration
  • Schedule regular drills: Run quarterly to maintain muscle memory
  • Expand scenarios: Add org-specific failure modes (network partitions, cloud provider outages)
  • Share learnings: Present findings to broader team to build collective knowledge

Safety Reminders

  • πŸ”’ Never run drills in production without formal chaos engineering approval
  • πŸ“‹ Document everything: Screenshots, metrics, timestamps, error messages
  • πŸ‘₯ Pair up: Run drills with a colleague to compare observations
  • ⏱️ Set time limits: Don't let exploration derail planned work
  • πŸ”„ Clean up thoroughly: Restore lab to known-good state after each drill