When duplicates land, the last change wins rule keeps results stable. Do not rely on arrival order; rely on source
timestamps or change versions.
Capture merge SQL in version control and tag it with the connector
version it expects. This makes rollbacks trivial when envelope changes
introduce new columns.
SQL templates you can adapt
Snowflake MERGE
MERGE INTO analytics.orders tgt
USING staging.orders_src src
ON tgt.order_id = src.order_id
WHEN MATCHED AND src.op = 'd' THEN DELETE
WHEN MATCHED THEN UPDATE SET
amount = src.amount,
status = src.status,
updated_at = src.ts_ms
WHEN NOT MATCHED AND src.op IN ('c', 'u') THEN
INSERT (order_id, amount, status, updated_at)
VALUES (src.order_id, src.amount, src.status, src.ts_ms);
BigQuery change-applier
CREATE OR REPLACE TABLE analytics.orders AS
SELECT AS VALUE latest.*
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY ts_ms DESC) AS rnk
FROM staging.orders_src
)
WHERE rnk = 1 AND op != 'd';
dbt incremental model
{{
config(
materialized='incremental',
unique_key='order_id'
)
}}
with ranked as (
select *,
row_number() over (
partition by order_id
order by ts_ms desc
) as rnk
from {{ source('cdc', 'orders') }}
)
select * except(rnk)
from ranked
where rnk = 1 and op != 'd';
Keep history when you need it
Add valid_from and valid_to columns around each change.
Close the previous version when a new update arrives.
Expose current rows via a view that filters on valid_to IS NULL.
This pattern powers SCD Type 2 tables without losing the benefits of CDC
freshness.
Handle late-arriving and out-of-order events
Track a watermark timestamp in your staging table and ignore events older than your replay budget.
Store raw envelopes for N days so you can reprocess with updated merge logic.
Document how to reconcile when the source database is manually patched (backfills, deletes).
Late arrivals are usually replayed by ops teams—make the procedure idempotent so you can run it twice without fear.
Operational guardrails
Validate inputs
Reject events that miss primary keys or required columns.
Track schema version and fail fast on incompatible changes.
Observe merge latency
Alert when staging tables age beyond the expected SLA.
Keep a backlog gauge so you know when to scale compute.
Disaster recovery
Retain staging data long enough to replay the last full window.
Store checkpoints (offset, batch id, timestamps) beside the target table.
Combine these runbooks with automated smoke tests that upsert and delete
a synthetic record daily. The job should fail if materialization drifts.
Data quality checks before publishing
Compare row counts between source and target using time-bounded windows.
Verify key uniqueness and detect primary-key churn with a duplicate audit query.
Validate soft deletes by ensuring tombstones propagate to downstream marts within SLA.
Embed these checks in orchestration (Airflow, Dagster, dbt) so they run
on every deploy and after infrastructure maintenance.
Verification playbooks
Pre-prod regression
Run fixtures that upsert, delete, and reinsert entities to validate merge branches.
Backfill a 1-hour slice from production CDC into staging and diff row-level hashes.
Exercise late-arrival and out-of-order scenarios before promoting SQL changes.
Production spot checks
Schedule canary merges on synthetic ids with deterministic payloads.
Export change event samples weekly and replay them against a shadow environment.
Rotate on-call owners through a monthly verification drill to keep docs fresh.
Treat verification as part of the deployment pipeline—pair SQL reviews with a matching verification checklist so no
merge ships without evidence that the materialized view still behaves.
Materialization readiness scorecard
Align the data platform, analytics, and ops teams on what “ready for production” means before turning on the CDC
materialization job.
0 of 4 ready (0%)
Checklist before promoting to production
Workstream
Ready when…
If not, take this step
Warehouse tables exist with primary keys, clustering, and partitioning that match the envelope design.
Provision tables with the final schema and run dry-run merges against staging data.
Incremental jobs have retries, alerting, and offset checkpointing wired up.
Add idempotent retries, persist offsets with the job metadata, and document restart steps.
Snapshot-to-stream cutover plan is rehearsed with clear checkpoints and rollback triggers.
Run a timed rehearsal in staging, log each checkpoint, and confirm revert instructions are executable.
Analytics owners sign off on data quality thresholds and downstream dashboards include freshness indicators.
Review metrics with stakeholders, add freshness badges to dashboards, and capture acceptance criteria in a doc.
All capabilities are ready. Toggle to see everything or reset to start over.
Materialization Knowledge Check
Test your understanding of upsert patterns and materialization strategies.
0
/
5answered
Q1
What is the purpose of a MERGE statement in CDC materialization?
✓
Correct!
The MERGE statement (or UPSERT) allows you to apply CDC events atomically: if a row with the key exists, update it; if not, insert it. Many platforms also support DELETE operations within MERGE. This single operation ensures consistency and simplifies sink logic.
✗
Not quite.
Review the correct answer and explanation.
Q2
What is the difference between a 'current state' table and a 'history' table in CDC materialization?
✓
Correct!
A current state (or snapshot) table holds only the most recent version of each row, updated via upserts and deletes. A history table appends every change as a new row, often with timestamps or sequence numbers, enabling time-travel queries and full audit trails.
✗
Not quite.
Review the correct answer and explanation.
Q3
Why is idempotency important when materializing CDC events?
✓
Correct!
With at-least-once delivery, events can be processed multiple times due to retries or replays. Idempotent operations (like MERGE or INSERT ... ON CONFLICT) ensure that reprocessing the same event doesn't corrupt data or cause duplicates, maintaining data integrity.
✗
Not quite.
Review the correct answer and explanation.
Q4
What is log compaction in Kafka, and how does it relate to CDC materialization?
✓
Correct!
Log compaction in Kafka retains only the most recent message for each key, discarding older versions. This creates a changelog table that mirrors the current state of the source. Consumers can rebuild the full current state by replaying the compacted log, making it ideal for CDC materialization.
✗
Not quite.
Review the correct answer and explanation.
Q5
What challenge does handling 'late-arriving' events present in CDC materialization?
✓
Correct!
Late-arriving events occur when network delays, retries, or multi-partition ordering cause older events to arrive after newer ones. Materialization logic must handle this by checking timestamps or sequence numbers, potentially discarding stale updates or triggering reconciliation to maintain correctness.
✗
Not quite.
Review the correct answer and explanation.
0/5correct
Further resources
Event envelope for envelope field semantics feeding your merges.