Observability Basics

Observability readiness tracker

See overall monitoring maturity at a glance before diving into each checklist.

0 of 0 readiness checks complete (0%)

Golden signals for CDC

Borrow the SRE playbook—latency, traffic, errors, and saturation—but map each to a CDC-specific metric.

Lag your north star
Signal	CDC metric	Why it matters
Latency	End-to-end lag (source commit to sink ingest).	Shows whether consumers see fresh data.
Traffic	Events per second per connector.	Highlights surges or drops in change volume.
Errors	Failed batches / DLQ counts.	Detects poison pills or downstream outages.
Saturation	Connector CPU, thread pool usage, or sink credits consumed.	Warns you before throttling kicks in.

Instrument producers and consumers with a shared tagging scheme (connector, source, target) so dashboards can drill from fleet-wide to single-table views without manual filtering.

Dashboards that reduce pager noise

Connector overview. Lag histogram, throughput trend, restart count, and DLQ volume in one view.
Source health. Log retention, replication slot usage, and database write rates to understand back pressure risk.
Sink success. Merge duration, rows touched, and error codes broken down by target system.

Link each graph to the precise runbook step. Analysts on-call should not have to guess which knob to turn.

Sample Grafana Dashboard

We've created a production-ready Grafana dashboard that includes all the golden signals for CDC monitoring. The dashboard features 8 panels covering lag, throughput, errors, connector status, DLQ volume, task health, restarts, and batch processing time.

What's included in the sample dashboard

Replication lag gauge — Visual indicator with green/yellow/red thresholds
Throughput chart — Source and sink records per second over time
Error rate graph — Track failed batches and error logs
Connector status panel — At-a-glance health check for all connectors
DLQ volume tracker — Monitor dead letter queue growth
Task running ratio — Detect saturation before it impacts throughput
Restart counter — Identify unstable connectors
Batch timing — Track processing performance trends

Download Grafana Dashboard JSON View Setup Guide

Alert policy that respects sleep

Primary alerts

Lag exceeds threshold for N minutes.
No offset commits in last 5 minutes while events continue arriving.
DLQ growth rate breaches baseline.

Warning alerts

Source log retention below replay budget.
Connector restart count above expected range.

Route primary alerts to the on-call rotation and copy product stakeholders only for customer-visible impact.

Sample Alert Configurations

We provide production-ready Prometheus alert rules that implement all the recommendations above. The alert pack includes 15 alerts across three categories:

Primary alerts (critical severity) — Wake up on-call for immediate action:
- High replication lag (> 5 minutes)
- No offset commits with active events
- DLQ volume spike (3x baseline)
- Connector not running
- High error rate
Warning alerts — Follow up during business hours:
- Source log retention low
- Excessive connector restarts
- Low throughput (90% below baseline)
- Task saturation (running ratio < 70%)
- Batch processing slowing
SLO-based alerts — Track service-level objectives:
- Freshness SLO breach (P99 lag > 5 min)
- Availability SLO breach (< 99% uptime)
- Data completeness at risk

Each alert includes detailed annotations with impact assessment, runbook links, and remediation steps.

Download Prometheus Alert Rules

Routing Alerts to External Systems

While Prometheus alerts appear in the Prometheus UI, you'll want to route them to external systems for real-world operations. We provide an Alertmanager configuration that adds support for:

Slack — Send warning alerts to dedicated channels
PagerDuty — Wake up on-call for critical issues
Email — Notify stakeholders via SMTP
Custom webhooks — Integrate with any system via HTTP

The Alertmanager configuration includes production-ready features:

Alert grouping by connector and severity to reduce noise
Intelligent inhibition rules (e.g., connector down suppresses lag alerts)
Rate limiting with smart repeat intervals (1h for critical, 12h for warnings)
Maintenance windows for planned downtime
Deduplication to prevent alert storms

Cloud-managed alerting alternatives

For production deployments, consider using managed services instead of self-hosting Alertmanager:

Grafana Cloud Alerting — Built-in integrations, mobile app, escalation chains
AWS CloudWatch — Native AWS integration via Prometheus remote_write
Datadog — APM tracing, log correlation, incident management
PagerDuty Event Intelligence — ML-based noise reduction, on-call management
Opsgenie — Advanced routing, incident response, status pages

Download Alertmanager Extension Alert Routing Config

Set service-level objectives

Freshness SLO: 99% of events land in the warehouse within 5 minutes of source commit.
Availability SLO: Change stream downtime < 15 minutes per 30-day window.
Data completeness: 99.99% of source transactions represented in the target within 1 hour.

Track error budgets alongside incident counts. When the budget is nearly spent, freeze risky changes and prioritize stability work.

Tracing & logs

Enrich logs with correlation identifiers: connector name, partition, offset, and source transaction id. Forward them to a central log store with retained searches for the last 30 days.

For custom processors, propagate trace headers from ingestion through to sinks so you can follow a change end-to-end.

Emit structured logs (JSON) so search queries do not rely on fragile parsing.
Tag spans with op (`c`, `u`, `d`) to correlate deletes with downstream compaction.

Complete monitoring stack

Ready to see it all working together? We provide a complete Docker Compose stack that includes Kafka Connect with JMX metrics enabled, Prometheus for collection, and Grafana with pre-configured dashboards.

What's in the monitoring stack

Kafka Connect with JMX metrics exposed
JMX Exporter to bridge JMX metrics to Prometheus format
Prometheus with our alert rules pre-loaded
Grafana with the CDC dashboard automatically provisioned
Alertmanager (optional) for routing alerts to Slack, PagerDuty, email
PostgreSQL source database (Debezium example)
kcat for Kafka CLI operations

Getting started in 5 minutes

Download the docker-compose file and supporting configs
Run docker-compose up -d
Create a test connector via the REST API
Open Grafana at http://localhost:3000 (admin/admin)
Watch metrics flow in real-time

Want external alerting? Add the Alertmanager extension to route alerts to Slack, PagerDuty, or email. See the Alert Policy section above for details.

The stack comes with everything wired up: datasources, dashboards, alert rules, and JMX metric mappings. No manual configuration required.

Files included in the monitoring stack

docker-compose-observability.yml — Main orchestration file
docker-compose.alerts.yml — Optional Alertmanager extension
prometheus.yml — Scrape configuration and targets
prometheus-with-alertmanager.yml — Prometheus config with Alertmanager enabled
prometheus-alerts.yml — Alert rule definitions
alertmanager.yml — Alert routing and receiver configuration
jmx-exporter-config.yml — JMX to Prometheus metric mappings
grafana-datasources.yml — Prometheus datasource config
grafana-dashboards.yml — Dashboard provisioning
grafana-kafka-connect-dashboard.json — Pre-built dashboard
OBSERVABILITY-SETUP.md — Complete setup guide

Download Monitoring Stack Setup Guide

The monitoring stack is perfect for local development, testing alert rules, and demonstrating CDC observability to stakeholders. For production deployment, see the setup guide for scaling and security considerations.

Runbook hygiene

Keep runbooks in version control; update them after every incident.
Link dashboards, alert names, and remediation scripts directly in the doc.
Review and rehearse quarterly with the on-call crew.

Post-incident reviews that stick

Capture customer impact, timeline, and contributing factors within 24 hours.
Assign action items that tie back to SLO breaches (instrumentation gaps, runbook fixes).
Close the loop by demoing fixes during the next reliability review.

Sharing the review with data consumers builds trust and keeps shadow monitoring from proliferating across teams.

Observability readiness scorecard

Use this scorecard during production readiness reviews to ensure monitoring covers the entire CDC surface area—from connectors to downstream consumers.

0 of 4 ready (0%)

Monitoring maturity checkpoints
Dimension	Ready when…	If not, shore it up by…
Golden signals	Lag, throughput, error rate, and saturation dashboards exist with ownership and SLO targets.	Add missing charts, wire alerts to on-call, and document runbook links beside each metric.
Alert routing	Pager policies differentiate wake-you-up incidents from follow-up tasks with clear escalation paths.	Restructure notification channels, prune noisy alerts, and note escalation contacts in the runbook.
Log & trace depth	Structured logs include correlation ids, connector, and table metadata; traces capture slow sinks.	Instrument connectors to emit correlation ids, enrich logs with metadata, and sample traces during load tests.
Review cadence	Post-incident reviews close on time and backlog items have explicit owners.	Schedule review retrospectives, assign owners in the incident tracker, and follow up during weekly ops syncs.

Further resources

Offsets & replays for the operational drills your alerts should trigger.
Reconciliation & Offset Surgery to detect and repair out-of-sync sinks using drift detection and checksum validation.
Event envelope to align telemetry with payload contracts.
Materialization 101 to ensure observed freshness ties to downstream SLAs.
Changelog & Version Matrix to verify compatible versions for your monitoring stack.

Observability readiness tracker

Golden signals for CDC

Dashboards that reduce pager noise

Sample Grafana Dashboard

Alert policy that respects sleep

Sample Alert Configurations

Routing Alerts to External Systems

Set service-level objectives

Tracing & logs

Complete monitoring stack

What's in the monitoring stack

Getting started in 5 minutes

Runbook hygiene

Post-incident reviews that stick

Observability readiness scorecard

Observability Knowledge Check

What are the 'golden signals' of observability for a CDC pipeline?

Why is consumer lag a critical metric in CDC?

What should you alert on to catch CDC pipeline failures quickly?

What is an SLO (Service Level Objective) in CDC?

Why should you monitor database replication lag (source-side) separately from CDC consumer lag?

Further resources

Let’s Talk CDC Interactive Dashboard