Ops Playbook: Offsets & Replays
Keep change streams healthy by protecting offset stores, practicing safe rewinds, and rehearsing replay drills.
Offset operations readiness tracker
Watch overall rewind preparedness climb as you complete each operational checklist.
0 of 0 readiness checks complete (0%)
No readiness checklists available yet.
Guardrail checklist
- Offsets persist to durable storage separate from the connector host.
- Backups exist for the last seven days of offset snapshots.
- Runbooks document how to pause, rewind, and resume each connector.
- DLQ triage flow is rehearsed quarterly.
- Access to offset stores is restricted to on-call engineers (MFA enforced).
Offset storage patterns
Treat offsets as crown jewels. They decide whether a replay is a clean rewind or a week-long dedupe exercise.
| Pipeline | Primary store | Secondary copy | Notes |
|---|---|---|---|
| Kafka Connect | __consumer_offsets (internal topic) |
MirrorMaker stream to DR cluster | Enable offset.flush.interval.ms < 5000 for tighter RPO. |
| Debezium Server | File-based or JDBC offset storage | Nightly object storage snapshot | Encrypt backups; restore drill should be documented. |
| Custom sink | Table in target warehouse | Application DB with strict RBAC | Persist last processed ts_ms per partition with checksum. |
Store offsets in a system that survives connector hosts. Local disk is a single point of failure.
Safe rewind procedure
- Pause ingestion and confirm downstream consumers are idle.
- Restore the offset snapshot you want to resume from.
- Validate that the target system can handle replays idempotently.
- Resume the connector and watch lag + error metrics.
- Document the incident: why rewind happened, what data was affected.
Never delete offsets ad hoc. Instead, copy them to a safe location and prove your recovery path in staging first.
Replay drills
Offset backup commands
Before attempting any offset manipulation, back up the current state.
The offset storage topic name is configured via OFFSET_STORAGE_TOPIC
in Kafka Connect (defaults to connect-offsets in the Debezium image).
# Discover your offset topic name from the Connect worker config
# Check the container environment variable
docker compose exec connect env | grep OFFSET_STORAGE_TOPIC
# Or list all Kafka Connect internal topics
kafka-topics --bootstrap-server localhost:9092 --list | grep connect
# Dump all offsets to a JSON file
# IMPORTANT: Replace 'connect-offsets' with your actual topic name if different
# IMPORTANT: Replace 'localhost:9092' with your Kafka broker address
kafka-console-consumer --bootstrap-server localhost:9092 \
--topic connect-offsets \
--from-beginning --timeout-ms 5000 \
--property print.key=true \
--property print.offset=true \
--property print.timestamp=true \
> offsets-backup-$(date +%Y%m%d-%H%M%S).json
# Verify backup contains data
wc -l offsets-backup-*.json
tail -5 offsets-backup-*.json
Topic name matters: The standard topic name is
connect-offsets, which is used by both the main
compose.yaml
and the lab tutorial.
Always verify the topic name in your deployment's Connect configuration
before dumping offsets to avoid an empty backup.
Quarterly dry run
- Pick a connector and simulate a failure that requires rewind.
- Time each step from detection to recovery.
- Capture deltas between the replayed data set and expected outcome.
Game day metrics
- Record lag peak, throughput dip, and recovery duration.
- Update runbooks with observed bottlenecks or manual fixes.
Monitoring & alerting
- Lag & throughput: Alert when lag growth exceeds the agreed budget or throughput drops suddenly.
- Offset staleness: Detect connectors that have not committed offsets in the last five minutes.
- DLQ spikes: Track dead-letter events per partition to spot poison pills early.
Automate the boring, dangerous steps
Codify replay drills so humans do not improvise under pressure.
#!/usr/bin/env bash
set -euo pipefail
CONNECTOR_NAME="$1"
TARGET_OFFSET_SNAPSHOT="$2"
echo "Pausing $CONNECTOR_NAME"
curl -X PUT "${CONNECT_URL}/connectors/${CONNECTOR_NAME}/pause"
aws s3 cp "s3://cdc-offsets/${CONNECTOR_NAME}/${TARGET_OFFSET_SNAPSHOT}" ./offsets.json
psql $OFFSET_STORE -c "\copy connect_offsets FROM './offsets.json' WITH (FORMAT json)"
curl -X POST "${CONNECT_URL}/connectors/${CONNECTOR_NAME}/resume"
Keep scripts idempotent—rerunning should yield the same offset state. Track revisions in version control.
Quarterly offset audits
- Verify snapshots exist for every critical connector and restore one into staging.
- Reconcile stored offsets against the target system watermark to catch drift.
- Document gaps, update runbooks, and assign owners for remediation tasks.
Pair the audit with a tabletop exercise covering simultaneous connector failure and downstream replay to keep teams confident under stress.
Offset operations readiness scorecard
Confirm the operations team can rewind confidently at 2 a.m. by checking each line item before you certify a connector as production-ready.
| Area | Ready when… | If not, next step |
|---|---|---|
| Primary and secondary offset stores are replicated and restore drills succeed within RTO. | Set up replication or snapshots, automate restore verification, and document expected recovery times. | |
| Pause/rewind/resume steps are scripted with parameterized inputs and owner contact info. | Convert manual docs into scripts, add sanity checks, and capture owner rotation details. | |
| Least-privilege access exists for offset stores with MFA and break-glass procedures. | Implement RBAC, rotate credentials, and rehearse break-glass approval flow with security. | |
| Latest quarterly drill has a postmortem with metrics (lag delta, replay duration, data diffs) and closed actions. | Schedule a rehearsal, capture metrics template, and assign follow-up owners before certifying the pipeline. |
All capabilities are ready. Toggle to see everything or reset to start over.
Offsets & Replays Knowledge Check
Test your understanding of offset management and safe replay procedures.
What is a consumer offset in Kafka-based CDC?
A consumer offset is a numerical pointer to the last message position a consumer successfully processed in a partition. By storing offsets, Kafka knows where each consumer is in the log, enabling resume after failures, preventing duplicate processing (in most cases), and supporting replays from specific points.
Review the correct answer and explanation.
Why is it dangerous to reset offsets to an arbitrary position without proper safeguards?
Resetting offsets to an earlier position causes reprocessing (potentially creating duplicates if consumers aren't idempotent). Resetting forward skips data, creating gaps. Without careful planning—knowing the log retention, sink state, and idempotency design—offset surgery can corrupt downstream systems.
Review the correct answer and explanation.
What is the purpose of log retention in Kafka?
Log retention defines how long Kafka retains messages before they're deleted (time-based) or compacted (size/key-based). This determines the maximum replay window: if retention is 7 days, you can only replay events from the past 7 days. Proper retention settings are critical for disaster recovery and replay scenarios.
Review the correct answer and explanation.
When should you consider replaying CDC events?
Replays are used to recover from failures (e.g., sink crashed and lost data), fix bugs (reprocess with corrected logic), test changes (validate a new schema or transformation), or rebuild a new downstream system from historical changes. Replays must be coordinated carefully to avoid inconsistencies.
Review the correct answer and explanation.
What is idempotency, and why is it critical for safe replays?
Idempotent operations produce the same outcome when applied multiple times. In CDC, if a replay causes duplicate processing (which is common), idempotent sinks (using UPSERT, natural keys, or deduplication logic) ensure replays don't corrupt data. Without idempotency, replays can create duplicate rows or incorrect state.
Review the correct answer and explanation.
Further resources
- Reconciliation & Offset Surgery for advanced offset manipulation and sink repair techniques.
- Event envelope for understanding replay impacts on payload contracts.
- Observability basics to plug audit metrics into dashboards.
- Materialization 101 for coordinating offset rewinds with downstream table recovery.