Beginner

Interactive Introduction to CDC

Start here if you're moving from batch ETL to real-time pipelines. You'll get the foundations, architectural trade-offs, and tooling map you need before diving into the advanced playbooks.

Change Data Capture (CDC) is the discipline of replicating data changes from a source database to downstream systems in near real time—without heavy full refreshes.

Outcome

You leave with a mental model for CDC, the vocabulary to compare approaches, and a roadmap into the deeper modules.

Who it’s for

  • Data engineers modernizing nightly batch ETL into streaming pipelines.
  • Platform and infra teams tasked with selecting CDC tooling.
  • Architects who need common language across producers, brokers, and sinks.

For decades, moving data between systems meant relying on slow, resource-intensive batch jobs that would run overnight.

This approach is no longer viable in a world that demands real-time data.

How do you keep disparate systems synchronized instantly without overwhelming your databases? This is the core problem that Change Data Capture (CDC) solves.

How to use this series

  1. Start with the overview: Align on terminology and the CDC value proposition in Series Overview.
  2. Use this intro as the foundation: Anchor on methods, architecture, and tooling landscape before exploring edge cases.
  3. Move into advanced patterns: Visit Exactly-Once, Snapshotting, and Schema Evolution for deeper dives.
  4. Practice and operationalize: Follow the labs and troubleshooting guides to validate your setup in real environments.

CDC methods

Choose the capture approach that matches your constraints. Each card highlights the trade-offs—expand any card for the full details.

Log-based CDC (The Gold Standard)

  • ✅ Lowest performance impact by tailing committed transaction logs.
  • ✅ Ordered change events support replay and downstream idempotence.
  • ⚠️ Requires elevated privileges and careful log-retention tuning.
Show full breakdown

Log-based CDC (The Gold Standard): Reads committed changes from the database’s transaction log (PostgreSQL WAL, MySQL binlog, SQL Server transaction log, Oracle redo). It’s low-impact on OLTP, but not free: log retention, I/O, and replication slot/archival settings must be tuned. It reliably captures inserts/updates/deletes in order. DDL capture depends on the database and connector configuration. PostgreSQL's Write-Ahead Log (WAL) records all changes to the database before they are applied, ensuring data integrity and durability. Log-based CDC leverages this existing, ordered stream of changes for efficient replication. MySQL's binary log (binlog) records all data modification statements that change the database state. When configured for row-based logging, it provides the granular, complete change events necessary for CDC.

  • Advantage: Low Performance Impact. Tails existing logs; no source-table scans or triggers.
  • Advantage: High Fidelity. Captures inserts, updates, and deletes, preserving order.
  • Disadvantage: Complexity. Requires DB-level privileges and correct log retention (wal_keep_size or logical slots in Postgres; binlog purge in MySQL; T-log backups in SQL Server).

Trigger-Based CDC

  • ✅ Works when log access is impossible or restricted.
  • ⚠️ Adds synchronous overhead to every OLTP transaction.
  • ⚠️ Triggers and shadow tables drift with schema changes.
Show full breakdown

Trigger-Based CDC: Uses database triggers (ON INSERT/UPDATE/DELETE) to write change rows into shadow/audit tables. Sometimes paired with an “outbox”-style table, but the Transactional Outbox pattern can also be implemented without triggers by writing the outbox row in the same application transaction. While it is database-agnostic and explicit, this approach adds overhead to every transaction, directly impacting the performance of your primary application's workload (OLTP).

  • Disadvantage: High Performance Impact. Synchronous trigger writes increase latency/lock contention.
  • Disadvantage: Management Complexity. Must maintain triggers per table; brittle with schema drift.

Query-Based CDC (Polling)

  • ✅ Simplest to bootstrap with schedulers and existing SQL.
  • ⚠️ Misses intermediate updates and hard deletes without tombstones.
  • ⚠️ Source database load grows with polling frequency.
Show full breakdown

Query-Based CDC (Polling): The simplest method to implement, but also the most fragile. This approach repeatedly queries source tables for rows that have changed, typically identified by a last_updated timestamp. However, it puts a significant load on the source database, can miss updates if multiple changes occur between polls, and cannot observe hard deletes (unless you keep tombstone/soft-delete markers).

  • Disadvantage: Cannot Capture Deletes. Deleted rows aren’t selectable.
  • Disadvantage: High Read Load and Latency. Freshness tied to polling interval.

Visualize the trade-offs

Scores range from 1 (weak fit) to 5 (excellent fit) across the priorities teams weigh when evaluating CDC approaches.

Methods at a Glance

Use the scorecard to choose the default (log-based) and spot when alternatives make sense.

Feature Log-Based CDC Trigger-Based CDC Query-Based CDC (Polling)
Performance Impact Very Low. Minimal overhead as it reads existing transaction logs. High. Adds a synchronous write to every transaction. Medium to High. Heavy repetitive reads on the source DB.
Reliability Very High. Captures all committed changes, including hard deletes, in order. High. Captures all DML via shadow table. Low. Misses deletes and intermediate updates.
Data Latency Very Low. Near real-time. Low. Near real-time as changes occur. High. Determined by polling interval.
Implementation Complexity High. Specialized tools, DB config, elevated perms. Medium. Write/maintain triggers and shadow tables. Low. Straightforward SQL + scheduler.
Source Schema Impact None. No schema changes. Medium. Shadow/audit tables required. High. Extra columns (timestamp/version/status).

This visualization compares the 3 ways to implement CDC to help show why log-based CDC is the gold standard. Use the toggles below to see which method aligns with the priority you care about right now.

Match priorities to methods

Select a priority to highlight the CDC approach that best fits. We will also spotlight the matching card above.

Log-Based CDC

Choose log-based CDC when you must minimize impact on the source database and need ordered, replayable change events.

Core components

The most common and robust architecture for a modern CDC pipeline combines a log-based capture mechanism with a scalable streaming platform.

  • Source Database: Operational DB whose WAL/redo log records committed changes.
  • CDC Connector: Debezium-like agent tails the log and emits structured change events.
  • Streaming Platform / Message Bus: Kafka (or similar) decouples source from consumers.
  • Schema Registry: Manages event schemas (Avro/JSON/Protobuf), enforces compatibility, and protects consumers during evolution.
  • Downstream Consumers: Warehouse loaders, search indexers, caches, apps, etc.
CDC pipeline architecture showing the complete data flow from source database through WAL/binlog, connector (Debezium/Airbyte), streaming platform (Kafka/Kinesis), to sink destinations (warehouse, search, cache)
The CDC pipeline architecture: Source DB → Connector → Stream → Sink. Map components to deeper dives: Source DB → Snapshotting, Connector → Connector Builder, Stream → Strategy, and Sink → Merge Cookbook.

How log-based CDC keeps order and consistency

Log tailers capture changes in commit order (per transaction). By streaming these events to a durable bus, you keep a consistent, replayable history that downstream systems can apply idempotently.

  1. Capture Connector reads WAL/binlog entries in commit order, preserving per-key ordering.
  2. Transform Enrich with metadata (op, LSN/SCN, before/after images) and publish to topics.
  3. Deliver Topics are partitioned by key for ordering; consumers use MERGE (upsert) semantics to apply changes idempotently.

Key takeaways

  • Log-based CDC minimizes source load and captures all verb types.
  • Downstream systems decide retention (history vs latest state).
  • Replay is built-in: rewind offsets to rebuild a sink.

Interactive CDC Pipeline Simulation

Experience how changes flow through a CDC pipeline in real-time. Trigger database operations and watch them propagate through each stage.

Source Database
CDC Connector
Message Broker
Target Sink

Current Records in Target

About this simulation: This is a browser-based educational demonstration showing how CDC captures and propagates database changes through a pipeline. Try different operations (INSERT, UPDATE, DELETE) and toggle between streaming and batch modes to see how data flows differ. In streaming mode, events are processed individually with natural delays. In batch mode, events are processed more rapidly in sequence.

Transactional outbox in practice

Pair an application table write with an outbox insert in the same transaction. A relay tails the outbox table and publishes events to a broker topic. Consumers upsert idempotently.

  1. Write App writes business row + outbox row within the same database transaction.
  2. Publish Agent emits an event to the broker/topic reliably (at-least-once).
  3. Consume idempotently Consumers upsert/dedupe so replays are safe—producing an “exactly-once” effect at the sink. EOS scope is per sink/system, not global.
See a sample transaction and emitted event
BEGIN;
INSERT INTO orders (id, status, total, updated_at)
VALUES (:order_id, :status, :total, NOW());

INSERT INTO order_outbox (event_id, aggregate_id, aggregate_type, payload, created_at)
VALUES (gen_random_uuid(), :order_id, 'order', jsonb_build_object('op', 'c', 'status', :status, 'total', :total), NOW());
COMMIT;
{
  "event_id": "b2e03f62-3d0f-4b0e-9a6a-a52b8cfe9c51",
  "op": "c",
  "entity": {
    "id": "order-458",
    "status": "shipped",
    "total": 125.99
  },
  "lsn": "898799:4",
  "ts_ms": 1700000123456
}

Deep dive: Exactly-Once & the Outbox pattern.

App + DB Outbox table Broker/Topic Consumer (UPSERT)

Operational gotchas (day-one checks)

  • Must Log retention & bloat: Logical slots (PG) or paused connectors can block log truncation; size WAL/binlog/T-log appropriately and alert on backlog. Troubleshooting playbook
  • Must Privileges & network: Connectors need replication permissions and stable network paths (no NAT timeouts) to avoid stalls. Deployment checklist
  • Should DDL changes: Schema changes may need registry rules and phased rollouts to avoid breaking consumers. Schema change guide
  • Must Sinks are the hard part: Upsert/delete semantics and idempotency in the warehouse/search system determine data correctness under replay. Merge & upsert patterns
  • Nice Instrument CDC-specific metrics (lag, slot depth, consumer replay time) so you can rehearse incident response without touching production. Observability tooling

The CDC Tooling Ecosystem

Picking CDC tooling is a trade-off across capture method (prefer log-based), ops model (managed vs self-managed), source/target coverage & versions, and exactly-once behavior at the sink (idempotent upserts + replay).

CDC platforms

Debezium + Kafka Connect

Open-source log-based CDC connectors. Flexible; you operate Kafka/Connect or use managed Kafka.

open sourcelog-basedDIY ops

Matillion (Data Loader / Designer)

Wizard-driven CDC into cloud warehouses plus orchestration/transform in one place.

managedlog-basedELT + orchestration

Fivetran

Managed connectors with log-based CDC options; emphasizes quick setup and hands-off operations.

managedlog-based

Precisely (Connect)

Enterprise replication including mainframe/legacy into modern targets.

enterpriselog-basedheterogeneous

Qlik Replicate (Attunity)

UI-driven enterprise CDC across a wide range of sources/targets.

enterpriselog-based

Oracle GoldenGate

Oracle’s CDC/replication stack (supports several non-Oracle endpoints) for mission-critical use.

enterpriselog-based

Test Your Understanding

Check your knowledge of CDC fundamentals with these questions.

Q1

What is the primary advantage of Change Data Capture (CDC) over traditional batch ETL?

Q2

Which CDC method directly reads the database's transaction log (WAL)?

Q3

What is the Transactional Outbox pattern?

Q4

Which component in a CDC pipeline is responsible for converting database-specific log formats into a standard event format?

Q5

What is a common challenge when implementing CDC that relates to the initial state of data?

0/5 correct

Choose your next step

Pick the path that matches the momentum you want to keep.

Real-world case study

20 min — See how an e-commerce company implemented CDC, from problem to production.

Strategy blueprint

15 min — Translate these fundamentals into a reference architecture and operating model.

Hands-on lab

45 min — Capture, stream, and replay changes with Kafka + Debezium in a guided exercise.

Tooling directory

10 min — Compare platforms and shortlist vendors aligned to your sources and targets.

Progress 0% No progress yet
Progress is stored locally in this browser.