Intermediate

Materialization 101

Master the merge logic that turns change events into durable, queryable tables across warehouses, lakes, and caches.

Materialization readiness tracker

Track how close each readiness checklist is to production-ready materialization in a single view.

0 of 0 readiness checks complete (0%)

    Choose the right pattern for the job

    Materialization strategies compared
    Pattern How it works Best for
    Mutable table (upsert/delete) Apply change events directly to a target table using merge logic. Serving layer caches, feature stores, operational analytics.
    Streaming merge Continuous ingestion job (dbt incremental, Snowpipe, Flink) performs row-level MERGE as events arrive. Low-latency marts where freshness is minutes, not hours.
    Append-only with views Store each event as a row, surface latest state via window functions. Auditable fact tables, debugging, ad-hoc comparisons.
    Incremental batch Land CDC events in stages, then run scheduled merge statements. Warehouses that favor batch execution or have strict cost controls.

    Build idempotent merge logic

    1. Stage events in a temporary table keyed by primary key and log position.
    2. Deduplicate by picking the highest ts_ms (or change version) per key.
    3. Merge updates and inserts using the latest event per key.
    4. Apply deletes as tombstones that remove or flag the row.
    5. Record the processed offset so restarts can resume safely.
    6. Publish metrics: rows inserted, updated, deleted, skipped duplicates.

    When duplicates land, the last change wins rule keeps results stable. Do not rely on arrival order; rely on source timestamps or change versions.

    Capture merge SQL in version control and tag it with the connector version it expects. This makes rollbacks trivial when envelope changes introduce new columns.

    SQL templates you can adapt

    Snowflake MERGE
    MERGE INTO analytics.orders tgt
    USING staging.orders_src src
      ON tgt.order_id = src.order_id
    WHEN MATCHED AND src.op = 'd' THEN DELETE
    WHEN MATCHED THEN UPDATE SET
      amount = src.amount,
      status = src.status,
      updated_at = src.ts_ms
    WHEN NOT MATCHED AND src.op IN ('c', 'u') THEN
      INSERT (order_id, amount, status, updated_at)
      VALUES (src.order_id, src.amount, src.status, src.ts_ms);
    
    BigQuery change-applier
    CREATE OR REPLACE TABLE analytics.orders AS
    SELECT AS VALUE latest.*
    FROM (
      SELECT *,
             ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY ts_ms DESC) AS rnk
      FROM staging.orders_src
    )
    WHERE rnk = 1 AND op != 'd';
    
    dbt incremental model
    {{
      config(
        materialized='incremental',
        unique_key='order_id'
      )
    }}
    
    with ranked as (
      select *,
             row_number() over (
               partition by order_id
               order by ts_ms desc
             ) as rnk
      from {{ source('cdc', 'orders') }}
    )
    
    select * except(rnk)
    from ranked
    where rnk = 1 and op != 'd';
    

    Keep history when you need it

    • Add valid_from and valid_to columns around each change.
    • Close the previous version when a new update arrives.
    • Expose current rows via a view that filters on valid_to IS NULL.

    This pattern powers SCD Type 2 tables without losing the benefits of CDC freshness.

    Handle late-arriving and out-of-order events

    • Track a watermark timestamp in your staging table and ignore events older than your replay budget.
    • Store raw envelopes for N days so you can reprocess with updated merge logic.
    • Document how to reconcile when the source database is manually patched (backfills, deletes).

    Late arrivals are usually replayed by ops teams—make the procedure idempotent so you can run it twice without fear.

    Operational guardrails

    Validate inputs
    • Reject events that miss primary keys or required columns.
    • Track schema version and fail fast on incompatible changes.
    Observe merge latency
    • Alert when staging tables age beyond the expected SLA.
    • Keep a backlog gauge so you know when to scale compute.
    Disaster recovery
    • Retain staging data long enough to replay the last full window.
    • Store checkpoints (offset, batch id, timestamps) beside the target table.

    Combine these runbooks with automated smoke tests that upsert and delete a synthetic record daily. The job should fail if materialization drifts.

    Data quality checks before publishing

    1. Compare row counts between source and target using time-bounded windows.
    2. Verify key uniqueness and detect primary-key churn with a duplicate audit query.
    3. Validate soft deletes by ensuring tombstones propagate to downstream marts within SLA.

    Embed these checks in orchestration (Airflow, Dagster, dbt) so they run on every deploy and after infrastructure maintenance.

    Verification playbooks

    Pre-prod regression

    • Run fixtures that upsert, delete, and reinsert entities to validate merge branches.
    • Backfill a 1-hour slice from production CDC into staging and diff row-level hashes.
    • Exercise late-arrival and out-of-order scenarios before promoting SQL changes.

    Production spot checks

    • Schedule canary merges on synthetic ids with deterministic payloads.
    • Export change event samples weekly and replay them against a shadow environment.
    • Rotate on-call owners through a monthly verification drill to keep docs fresh.

    Treat verification as part of the deployment pipeline—pair SQL reviews with a matching verification checklist so no merge ships without evidence that the materialized view still behaves.

    Materialization readiness scorecard

    Align the data platform, analytics, and ops teams on what “ready for production” means before turning on the CDC materialization job.

    0 of 4 ready (0%)

    Checklist before promoting to production
    Workstream Ready when… If not, take this step
    Warehouse tables exist with primary keys, clustering, and partitioning that match the envelope design. Provision tables with the final schema and run dry-run merges against staging data.
    Incremental jobs have retries, alerting, and offset checkpointing wired up. Add idempotent retries, persist offsets with the job metadata, and document restart steps.
    Snapshot-to-stream cutover plan is rehearsed with clear checkpoints and rollback triggers. Run a timed rehearsal in staging, log each checkpoint, and confirm revert instructions are executable.
    Analytics owners sign off on data quality thresholds and downstream dashboards include freshness indicators. Review metrics with stakeholders, add freshness badges to dashboards, and capture acceptance criteria in a doc.

    Materialization Knowledge Check

    Test your understanding of upsert patterns and materialization strategies.

    Q1

    What is the purpose of a MERGE statement in CDC materialization?

    Q2

    What is the difference between a 'current state' table and a 'history' table in CDC materialization?

    Q3

    Why is idempotency important when materializing CDC events?

    Q4

    What is log compaction in Kafka, and how does it relate to CDC materialization?

    Q5

    What challenge does handling 'late-arriving' events present in CDC materialization?

    0/5 correct

    Further resources

    Progress 0% No progress yet
    Progress is stored locally in this browser.