Change Data Capture (CDC) is the discipline of replicating data changes from a source database to downstream systems in near real time—without heavy full refreshes.
Outcome
You leave with a mental model for CDC, the vocabulary to compare approaches, and a roadmap into the deeper modules.
Who it’s for
- Data engineers modernizing nightly batch ETL into streaming pipelines.
- Platform and infra teams tasked with selecting CDC tooling.
- Architects who need common language across producers, brokers, and sinks.
For decades, moving data between systems meant relying on slow, resource-intensive batch jobs that would run overnight.
This approach is no longer viable in a world that demands real-time data.
How do you keep disparate systems synchronized instantly without overwhelming your databases? This is the core problem that Change Data Capture (CDC) solves.
How to use this series
- Start with the overview: Align on terminology and the CDC value proposition in Series Overview.
- Use this intro as the foundation: Anchor on methods, architecture, and tooling landscape before exploring edge cases.
- Move into advanced patterns: Visit Exactly-Once, Snapshotting, and Schema Evolution for deeper dives.
- Practice and operationalize: Follow the labs and troubleshooting guides to validate your setup in real environments.
CDC methods
Choose the capture approach that matches your constraints. Each card highlights the trade-offs—expand any card for the full details.
Log-based CDC (The Gold Standard)
- ✅ Lowest performance impact by tailing committed transaction logs.
- ✅ Ordered change events support replay and downstream idempotence.
- ⚠️ Requires elevated privileges and careful log-retention tuning.
Go deeper with the Debezium lab
Show full breakdown
Log-based CDC (The Gold Standard): Reads committed changes from the database’s transaction log (PostgreSQL WAL, MySQL binlog, SQL Server transaction log, Oracle redo). It’s low-impact on OLTP, but not free: log retention, I/O, and replication slot/archival settings must be tuned. It reliably captures inserts/updates/deletes in order. DDL capture depends on the database and connector configuration. PostgreSQL's Write-Ahead Log (WAL) records all changes to the database before they are applied, ensuring data integrity and durability. Log-based CDC leverages this existing, ordered stream of changes for efficient replication. MySQL's binary log (binlog) records all data modification statements that change the database state. When configured for row-based logging, it provides the granular, complete change events necessary for CDC.
- Advantage: Low Performance Impact. Tails existing logs; no source-table scans or triggers.
- Advantage: High Fidelity. Captures inserts, updates, and deletes, preserving order.
-
Disadvantage: Complexity. Requires DB-level privileges and correct log retention
(
wal_keep_sizeor logical slots in Postgres; binlog purge in MySQL; T-log backups in SQL Server).
Trigger-Based CDC
- ✅ Works when log access is impossible or restricted.
- ⚠️ Adds synchronous overhead to every OLTP transaction.
- ⚠️ Triggers and shadow tables drift with schema changes.
Go deeper with the transactional outbox
Show full breakdown
Trigger-Based CDC: Uses database triggers (ON INSERT/UPDATE/DELETE) to write change rows into shadow/audit tables. Sometimes paired with an “outbox”-style table, but the Transactional Outbox pattern can also be implemented without triggers by writing the outbox row in the same application transaction. While it is database-agnostic and explicit, this approach adds overhead to every transaction, directly impacting the performance of your primary application's workload (OLTP).
- Disadvantage: High Performance Impact. Synchronous trigger writes increase latency/lock contention.
- Disadvantage: Management Complexity. Must maintain triggers per table; brittle with schema drift.
Query-Based CDC (Polling)
- ✅ Simplest to bootstrap with schedulers and existing SQL.
- ⚠️ Misses intermediate updates and hard deletes without tombstones.
- ⚠️ Source database load grows with polling frequency.
Go deeper with CDC strategy trade-offs
Show full breakdown
Query-Based CDC (Polling): The simplest method to implement, but also the most fragile. This approach repeatedly queries source tables for rows that have changed, typically identified by a last_updated timestamp. However, it puts a significant load on the source database, can miss updates if multiple changes occur between polls, and cannot observe hard deletes (unless you keep tombstone/soft-delete markers).
- Disadvantage: Cannot Capture Deletes. Deleted rows aren’t selectable.
- Disadvantage: High Read Load and Latency. Freshness tied to polling interval.
Visualize the trade-offs
Scores range from 1 (weak fit) to 5 (excellent fit) across the priorities teams weigh when evaluating CDC approaches.
Methods at a Glance
Use the scorecard to choose the default (log-based) and spot when alternatives make sense.
| Feature | Log-Based CDC | Trigger-Based CDC | Query-Based CDC (Polling) |
|---|---|---|---|
| Performance Impact | Very Low. Minimal overhead as it reads existing transaction logs. | High. Adds a synchronous write to every transaction. | Medium to High. Heavy repetitive reads on the source DB. |
| Reliability | Very High. Captures all committed changes, including hard deletes, in order. | High. Captures all DML via shadow table. | Low. Misses deletes and intermediate updates. |
| Data Latency | Very Low. Near real-time. | Low. Near real-time as changes occur. | High. Determined by polling interval. |
| Implementation Complexity | High. Specialized tools, DB config, elevated perms. | Medium. Write/maintain triggers and shadow tables. | Low. Straightforward SQL + scheduler. |
| Source Schema Impact | None. No schema changes. | Medium. Shadow/audit tables required. | High. Extra columns (timestamp/version/status). |
| Best For | High-volume, mission-critical systems needing high fidelity + low impact. | Lower-volume systems or where log access is impossible. | Small, non-critical datasets or absolute last resort. |
This visualization compares the 3 ways to implement CDC to help show why log-based CDC is the gold standard. Use the toggles below to see which method aligns with the priority you care about right now.
Match priorities to methods
Select a priority to highlight the CDC approach that best fits. We will also spotlight the matching card above.
Log-Based CDC
Choose log-based CDC when you must minimize impact on the source database and need ordered, replayable change events.
Core components
The most common and robust architecture for a modern CDC pipeline combines a log-based capture mechanism with a scalable streaming platform.
- Source Database: Operational DB whose WAL/redo log records committed changes.
- CDC Connector: Debezium-like agent tails the log and emits structured change events.
- Streaming Platform / Message Bus: Kafka (or similar) decouples source from consumers.
- Schema Registry: Manages event schemas (Avro/JSON/Protobuf), enforces compatibility, and protects consumers during evolution.
- Downstream Consumers: Warehouse loaders, search indexers, caches, apps, etc.
Want the implementation details? Connector Builder shows how log readers stream events, Schema Evolution covers contract management, and Merge Cookbook demonstrates idempotent sink patterns.
How log-based CDC keeps order and consistency
Log tailers capture changes in commit order (per transaction). By streaming these events to a durable bus, you keep a consistent, replayable history that downstream systems can apply idempotently.
- Capture Connector reads WAL/binlog entries in commit order, preserving per-key ordering.
- Transform Enrich with metadata (op, LSN/SCN, before/after images) and publish to topics.
- Deliver Topics are partitioned by key for ordering; consumers use MERGE (upsert) semantics to apply changes idempotently.
Key takeaways
- Log-based CDC minimizes source load and captures all verb types.
- Downstream systems decide retention (history vs latest state).
- Replay is built-in: rewind offsets to rebuild a sink.
Interactive CDC Pipeline Simulation
Experience how changes flow through a CDC pipeline in real-time. Trigger database operations and watch them propagate through each stage.
Current Records in Target
Transactional outbox in practice
Pair an application table write with an outbox insert in the same transaction. A relay tails the outbox table and publishes events to a broker topic. Consumers upsert idempotently.
- Write App writes business row + outbox row within the same database transaction.
- Publish Agent emits an event to the broker/topic reliably (at-least-once).
- Consume idempotently Consumers upsert/dedupe so replays are safe—producing an “exactly-once” effect at the sink. EOS scope is per sink/system, not global.
See a sample transaction and emitted event
BEGIN;
INSERT INTO orders (id, status, total, updated_at)
VALUES (:order_id, :status, :total, NOW());
INSERT INTO order_outbox (event_id, aggregate_id, aggregate_type, payload, created_at)
VALUES (gen_random_uuid(), :order_id, 'order', jsonb_build_object('op', 'c', 'status', :status, 'total', :total), NOW());
COMMIT;
{
"event_id": "b2e03f62-3d0f-4b0e-9a6a-a52b8cfe9c51",
"op": "c",
"entity": {
"id": "order-458",
"status": "shipped",
"total": 125.99
},
"lsn": "898799:4",
"ts_ms": 1700000123456
}
Deep dive: Exactly-Once & the Outbox pattern.
Operational gotchas (day-one checks)
- Must Log retention & bloat: Logical slots (PG) or paused connectors can block log truncation; size WAL/binlog/T-log appropriately and alert on backlog. Troubleshooting playbook
- Must Privileges & network: Connectors need replication permissions and stable network paths (no NAT timeouts) to avoid stalls. Deployment checklist
- Should DDL changes: Schema changes may need registry rules and phased rollouts to avoid breaking consumers. Schema change guide
- Must Sinks are the hard part: Upsert/delete semantics and idempotency in the warehouse/search system determine data correctness under replay. Merge & upsert patterns
- Nice Instrument CDC-specific metrics (lag, slot depth, consumer replay time) so you can rehearse incident response without touching production. Observability tooling
The CDC Tooling Ecosystem
Picking CDC tooling is a trade-off across capture method (prefer log-based), ops model (managed vs self-managed), source/target coverage & versions, and exactly-once behavior at the sink (idempotent upserts + replay).
CDC platforms
Debezium + Kafka Connect
Open-source log-based CDC connectors. Flexible; you operate Kafka/Connect or use managed Kafka.
Matillion (Data Loader / Designer)
Wizard-driven CDC into cloud warehouses plus orchestration/transform in one place.
Fivetran
Managed connectors with log-based CDC options; emphasizes quick setup and hands-off operations.
Precisely (Connect)
Enterprise replication including mainframe/legacy into modern targets.
Qlik Replicate (Attunity)
UI-driven enterprise CDC across a wide range of sources/targets.
Oracle GoldenGate
Oracle’s CDC/replication stack (supports several non-Oracle endpoints) for mission-critical use.
AWS DMS
Managed migrations and ongoing CDC into AWS targets (and beyond via connectors).
Google Cloud Datastream
Serverless CDC for MySQL/Postgres/Oracle streaming into BigQuery/Cloud Storage.
Confluent Commercial Connectors
Kafka Connect ecosystem with additional commercial CDC connectors like Oracle.
StreamSets
Streaming pipelines with CDC connectors into warehouses/lakes.
Striim
CDC + in-flight SQL-like processing for real-time ops analytics.
IBM Data Replication
Log-based CDC for Db2 and mixed mainframe/distributed estates.
SAP SLT
Native trigger/log-based replication from SAP ECC/S/4HANA to downstream stores.
Hevo Data
Managed ELT with CDC support into cloud targets.
Airbyte
Open-source ELT; CDC varies by connector (check docs for log-based maturity per source).
Test Your Understanding
Check your knowledge of CDC fundamentals with these questions.
What is the primary advantage of Change Data Capture (CDC) over traditional batch ETL?
CDC captures only the changed data and streams it in near real-time, avoiding expensive full table scans and providing much lower latency compared to nightly batch ETL. This makes it ideal for real-time analytics and event-driven architectures.
Review the correct answer and explanation.
Which CDC method directly reads the database's transaction log (WAL)?
Log-based CDC reads directly from the database's write-ahead log (WAL) or transaction log. This approach has minimal impact on the source database and captures all changes without modifying application code or database schema.
Review the correct answer and explanation.
What is the Transactional Outbox pattern?
The Transactional Outbox pattern is an application-level approach where changes are written to a special outbox table within the same transaction as the business data. A separate process then reads and publishes these events. This ensures atomicity between the business operation and event publication.
Review the correct answer and explanation.
Which component in a CDC pipeline is responsible for converting database-specific log formats into a standard event format?
The source connector or capture agent reads the database logs and transforms them into standardized events (often with a schema like before/after images). Tools like Debezium serve this role, producing consistent event structures regardless of the source database type.
Review the correct answer and explanation.
What is a common challenge when implementing CDC that relates to the initial state of data?
One of the key challenges in CDC is performing an initial consistent snapshot of existing data before streaming live changes. This snapshot must be coordinated with the log position to avoid gaps or duplicates, a process known as the 'snapshot-to-stream handoff.'
Review the correct answer and explanation.