Intermediate

Ops Playbook: Offsets & Replays

Keep change streams healthy by protecting offset stores, practicing safe rewinds, and rehearsing replay drills.

View Ops Checklist Automate Replays

Offset operations readiness tracker

Watch overall rewind preparedness climb as you complete each operational checklist.

0 of 0 readiness checks complete (0%)

Guardrail checklist

Offsets persist to durable storage separate from the connector host.
Backups exist for the last seven days of offset snapshots.
Runbooks document how to pause, rewind, and resume each connector.
DLQ triage flow is rehearsed quarterly.
Access to offset stores is restricted to on-call engineers (MFA enforced).

Offset storage patterns

Treat offsets as crown jewels. They decide whether a replay is a clean rewind or a week-long dedupe exercise.

Choose durable storage intentionally
Pipeline	Primary store	Secondary copy	Notes
Kafka Connect	`__consumer_offsets` (internal topic)	MirrorMaker stream to DR cluster	Enable `offset.flush.interval.ms` < 5000 for tighter RPO.
Debezium Server	File-based or JDBC offset storage	Nightly object storage snapshot	Encrypt backups; restore drill should be documented.
Custom sink	Table in target warehouse	Application DB with strict RBAC	Persist last processed `ts_ms` per partition with checksum.

Store offsets in a system that survives connector hosts. Local disk is a single point of failure.

Safe rewind procedure

Pause ingestion and confirm downstream consumers are idle.
Restore the offset snapshot you want to resume from.
Validate that the target system can handle replays idempotently.
Resume the connector and watch lag + error metrics.
Document the incident: why rewind happened, what data was affected.

Never delete offsets ad hoc. Instead, copy them to a safe location and prove your recovery path in staging first.

Replay drills

Offset backup commands

Before attempting any offset manipulation, back up the current state. The offset storage topic name is configured via OFFSET_STORAGE_TOPIC in Kafka Connect (defaults to connect-offsets in the Debezium image).

# Discover your offset topic name from the Connect worker config
# Check the container environment variable
docker compose exec connect env | grep OFFSET_STORAGE_TOPIC

# Or list all Kafka Connect internal topics
kafka-topics --bootstrap-server localhost:9092 --list | grep connect

# Dump all offsets to a JSON file
# IMPORTANT: Replace 'connect-offsets' with your actual topic name if different
# IMPORTANT: Replace 'localhost:9092' with your Kafka broker address
kafka-console-consumer --bootstrap-server localhost:9092 \
  --topic connect-offsets \
  --from-beginning --timeout-ms 5000 \
  --property print.key=true \
  --property print.offset=true \
  --property print.timestamp=true \
  > offsets-backup-$(date +%Y%m%d-%H%M%S).json

# Verify backup contains data
wc -l offsets-backup-*.json
tail -5 offsets-backup-*.json

Topic name matters: The standard topic name is connect-offsets, which is used by both the main compose.yaml and the lab tutorial. Always verify the topic name in your deployment's Connect configuration before dumping offsets to avoid an empty backup.

Quarterly dry run

Pick a connector and simulate a failure that requires rewind.
Time each step from detection to recovery.
Capture deltas between the replayed data set and expected outcome.

Game day metrics

Record lag peak, throughput dip, and recovery duration.
Update runbooks with observed bottlenecks or manual fixes.

Monitoring & alerting

Lag & throughput: Alert when lag growth exceeds the agreed budget or throughput drops suddenly.
Offset staleness: Detect connectors that have not committed offsets in the last five minutes.
DLQ spikes: Track dead-letter events per partition to spot poison pills early.

Automate the boring, dangerous steps

Codify replay drills so humans do not improvise under pressure.

#!/usr/bin/env bash
set -euo pipefail

CONNECTOR_NAME="$1"
TARGET_OFFSET_SNAPSHOT="$2"

echo "Pausing $CONNECTOR_NAME"
curl -X PUT "${CONNECT_URL}/connectors/${CONNECTOR_NAME}/pause"

aws s3 cp "s3://cdc-offsets/${CONNECTOR_NAME}/${TARGET_OFFSET_SNAPSHOT}" ./offsets.json

psql $OFFSET_STORE -c "\copy connect_offsets FROM './offsets.json' WITH (FORMAT json)"

curl -X POST "${CONNECT_URL}/connectors/${CONNECTOR_NAME}/resume"

Keep scripts idempotent—rerunning should yield the same offset state. Track revisions in version control.

Quarterly offset audits

Verify snapshots exist for every critical connector and restore one into staging.
Reconcile stored offsets against the target system watermark to catch drift.
Document gaps, update runbooks, and assign owners for remediation tasks.

Pair the audit with a tabletop exercise covering simultaneous connector failure and downstream replay to keep teams confident under stress.

Offset operations readiness scorecard

Confirm the operations team can rewind confidently at 2 a.m. by checking each line item before you certify a connector as production-ready.

0 of 4 ready (0%)

Operational maturity checklist
Area	Ready when…	If not, next step
Offset durability	Primary and secondary offset stores are replicated and restore drills succeed within RTO.	Set up replication or snapshots, automate restore verification, and document expected recovery times.
Runbook quality	Pause/rewind/resume steps are scripted with parameterized inputs and owner contact info.	Convert manual docs into scripts, add sanity checks, and capture owner rotation details.
Access control	Least-privilege access exists for offset stores with MFA and break-glass procedures.	Implement RBAC, rotate credentials, and rehearse break-glass approval flow with security.
Replay rehearsal	Latest quarterly drill has a postmortem with metrics (lag delta, replay duration, data diffs) and closed actions.	Schedule a rehearsal, capture metrics template, and assign follow-up owners before certifying the pipeline.

Further resources

Reconciliation & Offset Surgery for advanced offset manipulation and sink repair techniques.
Event envelope for understanding replay impacts on payload contracts.
Observability basics to plug audit metrics into dashboards.
Materialization 101 for coordinating offset rewinds with downstream table recovery.

Progress 0% No progress yet

Progress is stored locally in this browser.

Ops Playbook: Offsets & Replays

Offset operations readiness tracker

Guardrail checklist

Offset storage patterns

Safe rewind procedure

Replay drills

Monitoring & alerting

Automate the boring, dangerous steps

Quarterly offset audits

Offset operations readiness scorecard

Offsets & Replays Knowledge Check

What is a consumer offset in Kafka-based CDC?

Why is it dangerous to reset offsets to an arbitrary position without proper safeguards?

What is the purpose of log retention in Kafka?

When should you consider replaying CDC events?

What is idempotency, and why is it critical for safe replays?

Further resources

Let’s Talk CDC Interactive Dashboard