Intermediate

Observability Basics

Instrument your CDC platform with lag, throughput, error, and saturation signals so incidents are caught before consumers notice.

Observability readiness tracker

See overall monitoring maturity at a glance before diving into each checklist.

0 of 0 readiness checks complete (0%)

    Golden signals for CDC

    Borrow the SRE playbook—latency, traffic, errors, and saturation—but map each to a CDC-specific metric.

    Lag your north star
    Signal CDC metric Why it matters
    Latency End-to-end lag (source commit to sink ingest). Shows whether consumers see fresh data.
    Traffic Events per second per connector. Highlights surges or drops in change volume.
    Errors Failed batches / DLQ counts. Detects poison pills or downstream outages.
    Saturation Connector CPU, thread pool usage, or sink credits consumed. Warns you before throttling kicks in.

    Instrument producers and consumers with a shared tagging scheme (connector, source, target) so dashboards can drill from fleet-wide to single-table views without manual filtering.

    Dashboards that reduce pager noise

    • Connector overview. Lag histogram, throughput trend, restart count, and DLQ volume in one view.
    • Source health. Log retention, replication slot usage, and database write rates to understand back pressure risk.
    • Sink success. Merge duration, rows touched, and error codes broken down by target system.

    Link each graph to the precise runbook step. Analysts on-call should not have to guess which knob to turn.

    Sample Grafana Dashboard

    We've created a production-ready Grafana dashboard that includes all the golden signals for CDC monitoring. The dashboard features 8 panels covering lag, throughput, errors, connector status, DLQ volume, task health, restarts, and batch processing time.

    What's included in the sample dashboard
    • Replication lag gauge — Visual indicator with green/yellow/red thresholds
    • Throughput chart — Source and sink records per second over time
    • Error rate graph — Track failed batches and error logs
    • Connector status panel — At-a-glance health check for all connectors
    • DLQ volume tracker — Monitor dead letter queue growth
    • Task running ratio — Detect saturation before it impacts throughput
    • Restart counter — Identify unstable connectors
    • Batch timing — Track processing performance trends

    Download Grafana Dashboard JSON View Setup Guide

    Alert policy that respects sleep

    Primary alerts
    • Lag exceeds threshold for N minutes.
    • No offset commits in last 5 minutes while events continue arriving.
    • DLQ growth rate breaches baseline.
    Warning alerts
    • Source log retention below replay budget.
    • Connector restart count above expected range.

    Route primary alerts to the on-call rotation and copy product stakeholders only for customer-visible impact.

    Sample Alert Configurations

    We provide production-ready Prometheus alert rules that implement all the recommendations above. The alert pack includes 15 alerts across three categories:

    • Primary alerts (critical severity) — Wake up on-call for immediate action:
      • High replication lag (> 5 minutes)
      • No offset commits with active events
      • DLQ volume spike (3x baseline)
      • Connector not running
      • High error rate
    • Warning alerts — Follow up during business hours:
      • Source log retention low
      • Excessive connector restarts
      • Low throughput (90% below baseline)
      • Task saturation (running ratio < 70%)
      • Batch processing slowing
    • SLO-based alerts — Track service-level objectives:
      • Freshness SLO breach (P99 lag > 5 min)
      • Availability SLO breach (< 99% uptime)
      • Data completeness at risk

    Each alert includes detailed annotations with impact assessment, runbook links, and remediation steps.

    Download Prometheus Alert Rules

    Routing Alerts to External Systems

    While Prometheus alerts appear in the Prometheus UI, you'll want to route them to external systems for real-world operations. We provide an Alertmanager configuration that adds support for:

    • Slack — Send warning alerts to dedicated channels
    • PagerDuty — Wake up on-call for critical issues
    • Email — Notify stakeholders via SMTP
    • Custom webhooks — Integrate with any system via HTTP

    The Alertmanager configuration includes production-ready features:

    • Alert grouping by connector and severity to reduce noise
    • Intelligent inhibition rules (e.g., connector down suppresses lag alerts)
    • Rate limiting with smart repeat intervals (1h for critical, 12h for warnings)
    • Maintenance windows for planned downtime
    • Deduplication to prevent alert storms
    Cloud-managed alerting alternatives

    For production deployments, consider using managed services instead of self-hosting Alertmanager:

    • Grafana Cloud Alerting — Built-in integrations, mobile app, escalation chains
    • AWS CloudWatch — Native AWS integration via Prometheus remote_write
    • Datadog — APM tracing, log correlation, incident management
    • PagerDuty Event Intelligence — ML-based noise reduction, on-call management
    • Opsgenie — Advanced routing, incident response, status pages

    Download Alertmanager Extension Alert Routing Config

    Set service-level objectives

    • Freshness SLO: 99% of events land in the warehouse within 5 minutes of source commit.
    • Availability SLO: Change stream downtime < 15 minutes per 30-day window.
    • Data completeness: 99.99% of source transactions represented in the target within 1 hour.

    Track error budgets alongside incident counts. When the budget is nearly spent, freeze risky changes and prioritize stability work.

    Tracing & logs

    Enrich logs with correlation identifiers: connector name, partition, offset, and source transaction id. Forward them to a central log store with retained searches for the last 30 days.

    For custom processors, propagate trace headers from ingestion through to sinks so you can follow a change end-to-end.

    • Emit structured logs (JSON) so search queries do not rely on fragile parsing.
    • Tag spans with op (`c`, `u`, `d`) to correlate deletes with downstream compaction.

    Complete monitoring stack

    Ready to see it all working together? We provide a complete Docker Compose stack that includes Kafka Connect with JMX metrics enabled, Prometheus for collection, and Grafana with pre-configured dashboards.

    What's in the monitoring stack

    • Kafka Connect with JMX metrics exposed
    • JMX Exporter to bridge JMX metrics to Prometheus format
    • Prometheus with our alert rules pre-loaded
    • Grafana with the CDC dashboard automatically provisioned
    • Alertmanager (optional) for routing alerts to Slack, PagerDuty, email
    • PostgreSQL source database (Debezium example)
    • kcat for Kafka CLI operations

    Getting started in 5 minutes

    1. Download the docker-compose file and supporting configs
    2. Run docker-compose up -d
    3. Create a test connector via the REST API
    4. Open Grafana at http://localhost:3000 (admin/admin)
    5. Watch metrics flow in real-time

    Want external alerting? Add the Alertmanager extension to route alerts to Slack, PagerDuty, or email. See the Alert Policy section above for details.

    The stack comes with everything wired up: datasources, dashboards, alert rules, and JMX metric mappings. No manual configuration required.

    Files included in the monitoring stack
    • docker-compose-observability.yml — Main orchestration file
    • docker-compose.alerts.yml — Optional Alertmanager extension
    • prometheus.yml — Scrape configuration and targets
    • prometheus-with-alertmanager.yml — Prometheus config with Alertmanager enabled
    • prometheus-alerts.yml — Alert rule definitions
    • alertmanager.yml — Alert routing and receiver configuration
    • jmx-exporter-config.yml — JMX to Prometheus metric mappings
    • grafana-datasources.yml — Prometheus datasource config
    • grafana-dashboards.yml — Dashboard provisioning
    • grafana-kafka-connect-dashboard.json — Pre-built dashboard
    • OBSERVABILITY-SETUP.md — Complete setup guide

    Download Monitoring Stack Setup Guide

    The monitoring stack is perfect for local development, testing alert rules, and demonstrating CDC observability to stakeholders. For production deployment, see the setup guide for scaling and security considerations.

    Runbook hygiene

    1. Keep runbooks in version control; update them after every incident.
    2. Link dashboards, alert names, and remediation scripts directly in the doc.
    3. Review and rehearse quarterly with the on-call crew.

    Post-incident reviews that stick

    1. Capture customer impact, timeline, and contributing factors within 24 hours.
    2. Assign action items that tie back to SLO breaches (instrumentation gaps, runbook fixes).
    3. Close the loop by demoing fixes during the next reliability review.

    Sharing the review with data consumers builds trust and keeps shadow monitoring from proliferating across teams.

    Observability readiness scorecard

    Use this scorecard during production readiness reviews to ensure monitoring covers the entire CDC surface area—from connectors to downstream consumers.

    0 of 4 ready (0%)

    Monitoring maturity checkpoints
    Dimension Ready when… If not, shore it up by…
    Lag, throughput, error rate, and saturation dashboards exist with ownership and SLO targets. Add missing charts, wire alerts to on-call, and document runbook links beside each metric.
    Pager policies differentiate wake-you-up incidents from follow-up tasks with clear escalation paths. Restructure notification channels, prune noisy alerts, and note escalation contacts in the runbook.
    Structured logs include correlation ids, connector, and table metadata; traces capture slow sinks. Instrument connectors to emit correlation ids, enrich logs with metadata, and sample traces during load tests.
    Post-incident reviews close on time and backlog items have explicit owners. Schedule review retrospectives, assign owners in the incident tracker, and follow up during weekly ops syncs.

    Observability Knowledge Check

    Test your understanding of CDC monitoring, alerting, and operational metrics.

    Q1

    What are the 'golden signals' of observability for a CDC pipeline?

    Q2

    Why is consumer lag a critical metric in CDC?

    Q3

    What should you alert on to catch CDC pipeline failures quickly?

    Q4

    What is an SLO (Service Level Objective) in CDC?

    Q5

    Why should you monitor database replication lag (source-side) separately from CDC consumer lag?

    0/5 correct

    Further resources

    Progress 0% No progress yet
    Progress is stored locally in this browser.