Intermediate

Ops Playbook: Offsets & Replays

Keep change streams healthy by protecting offset stores, practicing safe rewinds, and rehearsing replay drills.

Offset operations readiness tracker

Watch overall rewind preparedness climb as you complete each operational checklist.

0 of 0 readiness checks complete (0%)

    Guardrail checklist

    • Offsets persist to durable storage separate from the connector host.
    • Backups exist for the last seven days of offset snapshots.
    • Runbooks document how to pause, rewind, and resume each connector.
    • DLQ triage flow is rehearsed quarterly.
    • Access to offset stores is restricted to on-call engineers (MFA enforced).

    Offset storage patterns

    Treat offsets as crown jewels. They decide whether a replay is a clean rewind or a week-long dedupe exercise.

    Choose durable storage intentionally
    Pipeline Primary store Secondary copy Notes
    Kafka Connect __consumer_offsets (internal topic) MirrorMaker stream to DR cluster Enable offset.flush.interval.ms < 5000 for tighter RPO.
    Debezium Server File-based or JDBC offset storage Nightly object storage snapshot Encrypt backups; restore drill should be documented.
    Custom sink Table in target warehouse Application DB with strict RBAC Persist last processed ts_ms per partition with checksum.

    Store offsets in a system that survives connector hosts. Local disk is a single point of failure.

    Safe rewind procedure

    1. Pause ingestion and confirm downstream consumers are idle.
    2. Restore the offset snapshot you want to resume from.
    3. Validate that the target system can handle replays idempotently.
    4. Resume the connector and watch lag + error metrics.
    5. Document the incident: why rewind happened, what data was affected.

    Never delete offsets ad hoc. Instead, copy them to a safe location and prove your recovery path in staging first.

    Replay drills

    Offset backup commands

    Before attempting any offset manipulation, back up the current state. The offset storage topic name is configured via OFFSET_STORAGE_TOPIC in Kafka Connect (defaults to connect-offsets in the Debezium image).

    # Discover your offset topic name from the Connect worker config
    # Check the container environment variable
    docker compose exec connect env | grep OFFSET_STORAGE_TOPIC
    
    # Or list all Kafka Connect internal topics
    kafka-topics --bootstrap-server localhost:9092 --list | grep connect
    
    # Dump all offsets to a JSON file
    # IMPORTANT: Replace 'connect-offsets' with your actual topic name if different
    # IMPORTANT: Replace 'localhost:9092' with your Kafka broker address
    kafka-console-consumer --bootstrap-server localhost:9092 \
      --topic connect-offsets \
      --from-beginning --timeout-ms 5000 \
      --property print.key=true \
      --property print.offset=true \
      --property print.timestamp=true \
      > offsets-backup-$(date +%Y%m%d-%H%M%S).json
    
    # Verify backup contains data
    wc -l offsets-backup-*.json
    tail -5 offsets-backup-*.json

    Topic name matters: The standard topic name is connect-offsets, which is used by both the main compose.yaml and the lab tutorial. Always verify the topic name in your deployment's Connect configuration before dumping offsets to avoid an empty backup.

    Quarterly dry run
    • Pick a connector and simulate a failure that requires rewind.
    • Time each step from detection to recovery.
    • Capture deltas between the replayed data set and expected outcome.
    Game day metrics
    • Record lag peak, throughput dip, and recovery duration.
    • Update runbooks with observed bottlenecks or manual fixes.

    Monitoring & alerting

    • Lag & throughput: Alert when lag growth exceeds the agreed budget or throughput drops suddenly.
    • Offset staleness: Detect connectors that have not committed offsets in the last five minutes.
    • DLQ spikes: Track dead-letter events per partition to spot poison pills early.

    Automate the boring, dangerous steps

    Codify replay drills so humans do not improvise under pressure.

    #!/usr/bin/env bash
    set -euo pipefail
    
    CONNECTOR_NAME="$1"
    TARGET_OFFSET_SNAPSHOT="$2"
    
    echo "Pausing $CONNECTOR_NAME"
    curl -X PUT "${CONNECT_URL}/connectors/${CONNECTOR_NAME}/pause"
    
    aws s3 cp "s3://cdc-offsets/${CONNECTOR_NAME}/${TARGET_OFFSET_SNAPSHOT}" ./offsets.json
    
    psql $OFFSET_STORE -c "\copy connect_offsets FROM './offsets.json' WITH (FORMAT json)"
    
    curl -X POST "${CONNECT_URL}/connectors/${CONNECTOR_NAME}/resume"
    

    Keep scripts idempotent—rerunning should yield the same offset state. Track revisions in version control.

    Quarterly offset audits

    1. Verify snapshots exist for every critical connector and restore one into staging.
    2. Reconcile stored offsets against the target system watermark to catch drift.
    3. Document gaps, update runbooks, and assign owners for remediation tasks.

    Pair the audit with a tabletop exercise covering simultaneous connector failure and downstream replay to keep teams confident under stress.

    Offset operations readiness scorecard

    Confirm the operations team can rewind confidently at 2 a.m. by checking each line item before you certify a connector as production-ready.

    0 of 4 ready (0%)

    Operational maturity checklist
    Area Ready when… If not, next step
    Primary and secondary offset stores are replicated and restore drills succeed within RTO. Set up replication or snapshots, automate restore verification, and document expected recovery times.
    Pause/rewind/resume steps are scripted with parameterized inputs and owner contact info. Convert manual docs into scripts, add sanity checks, and capture owner rotation details.
    Least-privilege access exists for offset stores with MFA and break-glass procedures. Implement RBAC, rotate credentials, and rehearse break-glass approval flow with security.
    Latest quarterly drill has a postmortem with metrics (lag delta, replay duration, data diffs) and closed actions. Schedule a rehearsal, capture metrics template, and assign follow-up owners before certifying the pipeline.

    Offsets & Replays Knowledge Check

    Test your understanding of offset management and safe replay procedures.

    Q1

    What is a consumer offset in Kafka-based CDC?

    Q2

    Why is it dangerous to reset offsets to an arbitrary position without proper safeguards?

    Q3

    What is the purpose of log retention in Kafka?

    Q4

    When should you consider replaying CDC events?

    Q5

    What is idempotency, and why is it critical for safe replays?

    0/5 correct

    Further resources

    Progress 0% No progress yet
    Progress is stored locally in this browser.