Observability Basics
Instrument your CDC platform with lag, throughput, error, and saturation signals so incidents are caught before consumers notice.
Observability readiness tracker
See overall monitoring maturity at a glance before diving into each checklist.
0 of 0 readiness checks complete (0%)
No readiness checklists available yet.
Golden signals for CDC
Borrow the SRE playbook—latency, traffic, errors, and saturation—but map each to a CDC-specific metric.
| Signal | CDC metric | Why it matters |
|---|---|---|
| Latency | End-to-end lag (source commit to sink ingest). | Shows whether consumers see fresh data. |
| Traffic | Events per second per connector. | Highlights surges or drops in change volume. |
| Errors | Failed batches / DLQ counts. | Detects poison pills or downstream outages. |
| Saturation | Connector CPU, thread pool usage, or sink credits consumed. | Warns you before throttling kicks in. |
Instrument producers and consumers with a shared tagging scheme
(connector, source, target) so
dashboards can drill from fleet-wide to single-table views without
manual filtering.
Dashboards that reduce pager noise
- Connector overview. Lag histogram, throughput trend, restart count, and DLQ volume in one view.
- Source health. Log retention, replication slot usage, and database write rates to understand back pressure risk.
- Sink success. Merge duration, rows touched, and error codes broken down by target system.
Link each graph to the precise runbook step. Analysts on-call should not have to guess which knob to turn.
Sample Grafana Dashboard
We've created a production-ready Grafana dashboard that includes all the golden signals for CDC monitoring. The dashboard features 8 panels covering lag, throughput, errors, connector status, DLQ volume, task health, restarts, and batch processing time.
What's included in the sample dashboard
- Replication lag gauge — Visual indicator with green/yellow/red thresholds
- Throughput chart — Source and sink records per second over time
- Error rate graph — Track failed batches and error logs
- Connector status panel — At-a-glance health check for all connectors
- DLQ volume tracker — Monitor dead letter queue growth
- Task running ratio — Detect saturation before it impacts throughput
- Restart counter — Identify unstable connectors
- Batch timing — Track processing performance trends
Alert policy that respects sleep
Primary alerts
- Lag exceeds threshold for N minutes.
- No offset commits in last 5 minutes while events continue arriving.
- DLQ growth rate breaches baseline.
Warning alerts
- Source log retention below replay budget.
- Connector restart count above expected range.
Route primary alerts to the on-call rotation and copy product stakeholders only for customer-visible impact.
Sample Alert Configurations
We provide production-ready Prometheus alert rules that implement all the recommendations above. The alert pack includes 15 alerts across three categories:
- Primary alerts (critical severity) — Wake up on-call for immediate action:
- High replication lag (> 5 minutes)
- No offset commits with active events
- DLQ volume spike (3x baseline)
- Connector not running
- High error rate
- Warning alerts — Follow up during business hours:
- Source log retention low
- Excessive connector restarts
- Low throughput (90% below baseline)
- Task saturation (running ratio < 70%)
- Batch processing slowing
- SLO-based alerts — Track service-level objectives:
- Freshness SLO breach (P99 lag > 5 min)
- Availability SLO breach (< 99% uptime)
- Data completeness at risk
Each alert includes detailed annotations with impact assessment, runbook links, and remediation steps.
Download Prometheus Alert Rules
Routing Alerts to External Systems
While Prometheus alerts appear in the Prometheus UI, you'll want to route them to external systems for real-world operations. We provide an Alertmanager configuration that adds support for:
- Slack — Send warning alerts to dedicated channels
- PagerDuty — Wake up on-call for critical issues
- Email — Notify stakeholders via SMTP
- Custom webhooks — Integrate with any system via HTTP
The Alertmanager configuration includes production-ready features:
- Alert grouping by connector and severity to reduce noise
- Intelligent inhibition rules (e.g., connector down suppresses lag alerts)
- Rate limiting with smart repeat intervals (1h for critical, 12h for warnings)
- Maintenance windows for planned downtime
- Deduplication to prevent alert storms
Cloud-managed alerting alternatives
For production deployments, consider using managed services instead of self-hosting Alertmanager:
- Grafana Cloud Alerting — Built-in integrations, mobile app, escalation chains
- AWS CloudWatch — Native AWS integration via Prometheus remote_write
- Datadog — APM tracing, log correlation, incident management
- PagerDuty Event Intelligence — ML-based noise reduction, on-call management
- Opsgenie — Advanced routing, incident response, status pages
Set service-level objectives
- Freshness SLO: 99% of events land in the warehouse within 5 minutes of source commit.
- Availability SLO: Change stream downtime < 15 minutes per 30-day window.
- Data completeness: 99.99% of source transactions represented in the target within 1 hour.
Track error budgets alongside incident counts. When the budget is nearly spent, freeze risky changes and prioritize stability work.
Tracing & logs
Enrich logs with correlation identifiers: connector name, partition, offset, and source transaction id. Forward them to a central log store with retained searches for the last 30 days.
For custom processors, propagate trace headers from ingestion through to sinks so you can follow a change end-to-end.
- Emit structured logs (JSON) so search queries do not rely on fragile parsing.
- Tag spans with
op(`c`, `u`, `d`) to correlate deletes with downstream compaction.
Complete monitoring stack
Ready to see it all working together? We provide a complete Docker Compose stack that includes Kafka Connect with JMX metrics enabled, Prometheus for collection, and Grafana with pre-configured dashboards.
What's in the monitoring stack
- Kafka Connect with JMX metrics exposed
- JMX Exporter to bridge JMX metrics to Prometheus format
- Prometheus with our alert rules pre-loaded
- Grafana with the CDC dashboard automatically provisioned
- Alertmanager (optional) for routing alerts to Slack, PagerDuty, email
- PostgreSQL source database (Debezium example)
- kcat for Kafka CLI operations
Getting started in 5 minutes
- Download the docker-compose file and supporting configs
- Run
docker-compose up -d - Create a test connector via the REST API
- Open Grafana at
http://localhost:3000(admin/admin) - Watch metrics flow in real-time
Want external alerting? Add the Alertmanager extension to route alerts to Slack, PagerDuty, or email. See the Alert Policy section above for details.
The stack comes with everything wired up: datasources, dashboards, alert rules, and JMX metric mappings. No manual configuration required.
Files included in the monitoring stack
docker-compose-observability.yml— Main orchestration filedocker-compose.alerts.yml— Optional Alertmanager extensionprometheus.yml— Scrape configuration and targetsprometheus-with-alertmanager.yml— Prometheus config with Alertmanager enabledprometheus-alerts.yml— Alert rule definitionsalertmanager.yml— Alert routing and receiver configurationjmx-exporter-config.yml— JMX to Prometheus metric mappingsgrafana-datasources.yml— Prometheus datasource configgrafana-dashboards.yml— Dashboard provisioninggrafana-kafka-connect-dashboard.json— Pre-built dashboardOBSERVABILITY-SETUP.md— Complete setup guide
Download Monitoring Stack Setup Guide
The monitoring stack is perfect for local development, testing alert rules, and demonstrating CDC observability to stakeholders. For production deployment, see the setup guide for scaling and security considerations.
Runbook hygiene
- Keep runbooks in version control; update them after every incident.
- Link dashboards, alert names, and remediation scripts directly in the doc.
- Review and rehearse quarterly with the on-call crew.
Post-incident reviews that stick
- Capture customer impact, timeline, and contributing factors within 24 hours.
- Assign action items that tie back to SLO breaches (instrumentation gaps, runbook fixes).
- Close the loop by demoing fixes during the next reliability review.
Sharing the review with data consumers builds trust and keeps shadow monitoring from proliferating across teams.
Observability readiness scorecard
Use this scorecard during production readiness reviews to ensure monitoring covers the entire CDC surface area—from connectors to downstream consumers.
| Dimension | Ready when… | If not, shore it up by… |
|---|---|---|
| Lag, throughput, error rate, and saturation dashboards exist with ownership and SLO targets. | Add missing charts, wire alerts to on-call, and document runbook links beside each metric. | |
| Pager policies differentiate wake-you-up incidents from follow-up tasks with clear escalation paths. | Restructure notification channels, prune noisy alerts, and note escalation contacts in the runbook. | |
| Structured logs include correlation ids, connector, and table metadata; traces capture slow sinks. | Instrument connectors to emit correlation ids, enrich logs with metadata, and sample traces during load tests. | |
| Post-incident reviews close on time and backlog items have explicit owners. | Schedule review retrospectives, assign owners in the incident tracker, and follow up during weekly ops syncs. |
All capabilities are ready. Toggle to see everything or reset to start over.
Observability Knowledge Check
Test your understanding of CDC monitoring, alerting, and operational metrics.
What are the 'golden signals' of observability for a CDC pipeline?
The golden signals for CDC are: Lag (how far behind consumers are), Throughput (events per second), Error Rate (failures, retries), and Saturation (resource utilization). These metrics provide a holistic view of pipeline health and help detect issues before they impact downstream systems.
Review the correct answer and explanation.
Why is consumer lag a critical metric in CDC?
Consumer lag measures the delay between when an event is produced and when it's consumed. High lag indicates consumers can't keep up with production rate, leading to stale data in downstream systems. Monitoring lag helps detect performance issues, resource constraints, or failures early.
Review the correct answer and explanation.
What should you alert on to catch CDC pipeline failures quickly?
Effective alerting focuses on actionable signals: lag exceeding SLO thresholds, error rate spikes, connector state transitions (running → failed), database or sink unavailability. Avoid alert fatigue by setting meaningful thresholds and grouping related alerts to catch real issues without overwhelming operators.
Review the correct answer and explanation.
What is an SLO (Service Level Objective) in CDC?
An SLO is a measurable target for service quality, such as 'p99 consumer lag must be < 5 seconds' or '99.9% of events processed without errors.' SLOs guide monitoring, alerting, and capacity planning. They help teams balance reliability investments against business needs and communicate expectations to stakeholders.
Review the correct answer and explanation.
Why should you monitor database replication lag (source-side) separately from CDC consumer lag?
Source replication lag (e.g., read replica lag) affects when changes become visible in the transaction log that CDC reads. CDC consumer lag measures the delay in processing those changes. Both contribute to end-to-end latency. Monitoring both helps isolate whether delays originate at the source database or in the CDC pipeline itself.
Review the correct answer and explanation.
Further resources
- Offsets & replays for the operational drills your alerts should trigger.
- Reconciliation & Offset Surgery to detect and repair out-of-sync sinks using drift detection and checksum validation.
- Event envelope to align telemetry with payload contracts.
- Materialization 101 to ensure observed freshness ties to downstream SLAs.
- Changelog & Version Matrix to verify compatible versions for your monitoring stack.