Intermediate

Materialization 101

Master the merge logic that turns change events into durable, queryable tables across warehouses, lakes, and caches.

Compare Patterns Ship Quality Gates

Materialization readiness tracker

Track how close each readiness checklist is to production-ready materialization in a single view.

0 of 0 readiness checks complete (0%)

Choose the right pattern for the job

Materialization strategies compared
Pattern	How it works	Best for
Mutable table (upsert/delete)	Apply change events directly to a target table using merge logic.	Serving layer caches, feature stores, operational analytics.
Streaming merge	Continuous ingestion job (dbt incremental, Snowpipe, Flink) performs row-level MERGE as events arrive.	Low-latency marts where freshness is minutes, not hours.
Append-only with views	Store each event as a row, surface latest state via window functions.	Auditable fact tables, debugging, ad-hoc comparisons.
Incremental batch	Land CDC events in stages, then run scheduled merge statements.	Warehouses that favor batch execution or have strict cost controls.

Build idempotent merge logic

Stage events in a temporary table keyed by primary key and log position.
Deduplicate by picking the highest ts_ms (or change version) per key.
Merge updates and inserts using the latest event per key.
Apply deletes as tombstones that remove or flag the row.
Record the processed offset so restarts can resume safely.
Publish metrics: rows inserted, updated, deleted, skipped duplicates.

When duplicates land, the last change wins rule keeps results stable. Do not rely on arrival order; rely on source timestamps or change versions.

Capture merge SQL in version control and tag it with the connector version it expects. This makes rollbacks trivial when envelope changes introduce new columns.

SQL templates you can adapt

Snowflake MERGE

MERGE INTO analytics.orders tgt
USING staging.orders_src src
  ON tgt.order_id = src.order_id
WHEN MATCHED AND src.op = 'd' THEN DELETE
WHEN MATCHED THEN UPDATE SET
  amount = src.amount,
  status = src.status,
  updated_at = src.ts_ms
WHEN NOT MATCHED AND src.op IN ('c', 'u') THEN
  INSERT (order_id, amount, status, updated_at)
  VALUES (src.order_id, src.amount, src.status, src.ts_ms);

BigQuery change-applier

CREATE OR REPLACE TABLE analytics.orders AS
SELECT AS VALUE latest.*
FROM (
  SELECT *,
         ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY ts_ms DESC) AS rnk
  FROM staging.orders_src
)
WHERE rnk = 1 AND op != 'd';

dbt incremental model

{{
  config(
    materialized='incremental',
    unique_key='order_id'
  )
}}

with ranked as (
  select *,
         row_number() over (
           partition by order_id
           order by ts_ms desc
         ) as rnk
  from {{ source('cdc', 'orders') }}
)

select * except(rnk)
from ranked
where rnk = 1 and op != 'd';

Keep history when you need it

Add valid_from and valid_to columns around each change.
Close the previous version when a new update arrives.
Expose current rows via a view that filters on valid_to IS NULL.

This pattern powers SCD Type 2 tables without losing the benefits of CDC freshness.

Handle late-arriving and out-of-order events

Track a watermark timestamp in your staging table and ignore events older than your replay budget.
Store raw envelopes for N days so you can reprocess with updated merge logic.
Document how to reconcile when the source database is manually patched (backfills, deletes).

Late arrivals are usually replayed by ops teams—make the procedure idempotent so you can run it twice without fear.

Operational guardrails

Validate inputs

Reject events that miss primary keys or required columns.
Track schema version and fail fast on incompatible changes.

Observe merge latency

Alert when staging tables age beyond the expected SLA.
Keep a backlog gauge so you know when to scale compute.

Disaster recovery

Retain staging data long enough to replay the last full window.
Store checkpoints (offset, batch id, timestamps) beside the target table.

Combine these runbooks with automated smoke tests that upsert and delete a synthetic record daily. The job should fail if materialization drifts.

Data quality checks before publishing

Compare row counts between source and target using time-bounded windows.
Verify key uniqueness and detect primary-key churn with a duplicate audit query.
Validate soft deletes by ensuring tombstones propagate to downstream marts within SLA.

Embed these checks in orchestration (Airflow, Dagster, dbt) so they run on every deploy and after infrastructure maintenance.

Verification playbooks

Pre-prod regression

Run fixtures that upsert, delete, and reinsert entities to validate merge branches.
Backfill a 1-hour slice from production CDC into staging and diff row-level hashes.
Exercise late-arrival and out-of-order scenarios before promoting SQL changes.

Production spot checks

Schedule canary merges on synthetic ids with deterministic payloads.
Export change event samples weekly and replay them against a shadow environment.
Rotate on-call owners through a monthly verification drill to keep docs fresh.

Treat verification as part of the deployment pipeline—pair SQL reviews with a matching verification checklist so no merge ships without evidence that the materialized view still behaves.

Materialization readiness scorecard

Align the data platform, analytics, and ops teams on what “ready for production” means before turning on the CDC materialization job.

0 of 4 ready (0%)

Checklist before promoting to production
Workstream	Ready when…	If not, take this step
Target schema	Warehouse tables exist with primary keys, clustering, and partitioning that match the envelope design.	Provision tables with the final schema and run dry-run merges against staging data.
Merge orchestration	Incremental jobs have retries, alerting, and offset checkpointing wired up.	Add idempotent retries, persist offsets with the job metadata, and document restart steps.
Backfill strategy	Snapshot-to-stream cutover plan is rehearsed with clear checkpoints and rollback triggers.	Run a timed rehearsal in staging, log each checkpoint, and confirm revert instructions are executable.
Stakeholder buy-in	Analytics owners sign off on data quality thresholds and downstream dashboards include freshness indicators.	Review metrics with stakeholders, add freshness badges to dashboards, and capture acceptance criteria in a doc.

Further resources

Event envelope for envelope field semantics feeding your merges.
Offsets & replays to keep staging tables consistent during rewinds.
Observability basics to monitor merge throughput, freshness, and failure rates.

Progress 0% No progress yet

Progress is stored locally in this browser.

Materialization 101

Materialization readiness tracker

Choose the right pattern for the job

Build idempotent merge logic

SQL templates you can adapt

Keep history when you need it

Handle late-arriving and out-of-order events

Operational guardrails

Data quality checks before publishing

Verification playbooks

Pre-prod regression

Production spot checks

Materialization readiness scorecard

Materialization Knowledge Check

What is the purpose of a MERGE statement in CDC materialization?

What is the difference between a 'current state' table and a 'history' table in CDC materialization?

Why is idempotency important when materializing CDC events?

What is log compaction in Kafka, and how does it relate to CDC materialization?

What challenge does handling 'late-arriving' events present in CDC materialization?

Further resources

Let’s Talk CDC Interactive Dashboard