Real-World CDC Case Study | CDC: The Missing Manual

Part 1: The Business Context

Company Profile

Industry: E-commerce (fashion & accessories)
Scale: 5M orders/year, 2M active customers, 50K SKUs
Tech Stack: PostgreSQL (OLTP), Snowflake (warehouse), Tableau (BI)
Team: 8 data engineers, 20 analysts, 200 total engineering

The Pain Points

ShopStream's analytics ran on a nightly batch ETL pipeline. Every night at 2 AM, a series of complex SQL queries would extract data from the production PostgreSQL database, transform it, and load it into Snowflake. This architecture had served them well during early growth, but was now causing significant problems:

❌ Decision Latency (24+ Hours)

The marketing team couldn't respond to real-time campaign performance. A flash sale that underperformed wouldn't be detected until the next day, missing the critical window for adjustments. Customer support couldn't see real-time order status, leading to frustrated customers and increased call volume.

❌ Production Database Load

The ETL queries scanned entire tables every night, causing noticeable performance degradation. During peak shopping seasons (Black Friday, holidays), the ETL window would extend into business hours, causing customer-facing slowdowns. The team had to carefully schedule maintenance windows around the ETL.

❌ Operational Brittleness

Failed ETL jobs meant analysts started their day with stale data and no clear ETA for fixes. Schema changes required coordinated updates across extraction scripts, transformation logic, and target tables. The on-call rotation dreaded weekends, when failures might not be noticed until Monday.

❌ Growing Complexity

As the business added features (subscriptions, gift cards, international shipping), the ETL became increasingly complex. The team spent more time maintaining and debugging ETL code than building new analytics capabilities. Technical debt was accumulating faster than they could pay it down.

The Breaking Point

The decision to investigate CDC came after a Black Friday incident. A pricing error on a high-margin product went undetected for 8 hours because the dashboard showing "revenue per category" wouldn't refresh until the next morning's ETL run. By the time the team noticed, the company had lost over $150K in margin.

"We realized we were flying blind. Our data was always yesterday's story, and in e-commerce, yesterday might as well be last year." — Sarah Chen, VP of Data Engineering, ShopStream

Business Requirements

The leadership team set clear goals for a new data architecture:

Sub-5-minute latency: Key metrics must reflect reality within 5 minutes
Zero impact to OLTP: No additional load on the production database
Incremental rollout: Migrate table-by-table, not a "big bang" cutover
Cost-neutral: New infrastructure costs offset by retiring legacy ETL infrastructure
6-month timeline: Full migration complete within two quarters

Part 2: Solution Evaluation

The Requirements

The data engineering team evaluated solutions against a clear set of technical requirements:

Log-based capture: Must read from PostgreSQL WAL, not polling
Exactly-once delivery: No duplicate rows in Snowflake from replays
Schema evolution support: Handle ADD COLUMN, data type changes gracefully
Snapshot capability: Backfill historical data without impacting production
Operational maturity: Monitoring, alerting, and runbooks available

Tool Comparison

The team evaluated three major approaches, considering both technical fit and total cost of ownership:

Option 1: Debezium (Open Source)

Architecture: Self-hosted Kafka + Kafka Connect + Debezium PostgreSQL connector + Snowflake Sink

Pros:

No licensing costs (Apache 2.0)
Full control over configuration and scaling
Battle-tested for PostgreSQL CDC
Large community and extensive documentation
Can handle complex transformations in Kafka Streams

Cons:

Requires Kafka expertise on the team
Operational burden: patching, upgrades, capacity planning
Need to build monitoring and alerting from scratch
Estimated 2 FTEs for ongoing operations

Estimated Cost: $15K/month (infrastructure) + 2 FTEs (~$35K/month fully loaded) = $50K/month

Option 2: Fivetran (Managed SaaS)

Architecture: Fivetran PostgreSQL connector → Fivetran cloud → Snowflake

Pros:

Zero infrastructure management
Fast time-to-value (setup in hours)
Built-in monitoring and alerting
Automatic schema drift handling
24/7 support from vendor

Cons:

Pricing based on Monthly Active Rows (MAR)
Less control over data transformations
Vendor lock-in concerns
Limited customization options
Data flows through third-party infrastructure

Estimated Cost: $25K/month (based on projected MAR) + 0.5 FTE (~$9K/month) = $34K/month

Option 3: AWS DMS (Cloud-Native)

Architecture: DMS PostgreSQL source → DMS replication instance → Snowflake via S3 staging

Pros:

Fully managed by AWS
Native integration with existing AWS infrastructure
Granular cost control (pay per instance hour)
CloudWatch integration for monitoring

Cons:

Less mature CDC support than Debezium
Limited transformation capabilities
Occasional consistency issues reported in community
Requires S3 staging bucket for Snowflake (added complexity)
Snowflake connector requires custom integration via S3

Estimated Cost: $12K/month (DMS + S3 + data transfer) + 1 FTE (~$18K/month) = $30K/month

The Decision: Debezium + Kafka

After careful evaluation, the team chose Debezium on Kafka, despite the higher operational complexity. The deciding factors were:

Strategic Investment: Building Kafka expertise would pay dividends beyond CDC. The team had already identified multiple use cases for event streaming (inventory updates, order notifications, fraud detection).
Flexibility: Open-source meant no limits on customization. They could handle complex scenarios like multi-tenancy and custom transformations.
Cost Trajectory: While initially more expensive, costs would remain predictable as data volumes grew, unlike MAR-based pricing.
Team Skills: Two senior engineers had prior Kafka experience, reducing the ramp-up risk.

"We knew Debezium would be harder upfront, but we were building a capability, not just solving a point problem. Kafka would become our event backbone." — Marcus Rodriguez, Principal Engineer, ShopStream

Part 3: Solution Architecture

The End-to-End Pipeline

The final architecture consists of five major components, each playing a specific role in the data flow:

graph LR
    A[PostgreSQL OLTP
Orders, Customers, Products] -->|WAL Stream| B[Debezium Connector
Kafka Connect]
    B -->|Change Events
JSON/Avro| C[Kafka Topics
orders.public.orders
orders.public.customers]
    C -->|Stream| D[Snowflake Sink
Kafka Connect]
    C -->|Real-time| E[Kafka Streams
Aggregations]
    D -->|MERGE/UPSERT| F[Snowflake
Analytics Tables]
    E -->|Live Metrics| G[Redis Cache
Dashboard API]
    F -->|Historical Analysis| H[Tableau
BI Dashboards]
    G -->|Current State| H

    style A fill:#e1f5ff
    style C fill:#fff3cd
    style F fill:#d4edda
    style G fill:#f8d7da

Component Details

1. Source: PostgreSQL (RDS)

The production database remained unchanged, except for enabling logical replication:

-- Enable logical replication (requires restart)
ALTER SYSTEM SET wal_level = 'logical';
ALTER SYSTEM SET max_replication_slots = 10;
ALTER SYSTEM SET max_wal_senders = 10;

-- Create replication slot for Debezium
SELECT pg_create_logical_replication_slot('debezium_slot', 'pgoutput');

-- Grant necessary permissions
GRANT SELECT ON ALL TABLES IN SCHEMA public TO debezium_user;
GRANT USAGE ON SCHEMA public TO debezium_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO debezium_user;

Key Decision: They used RDS PostgreSQL (not Aurora) because Aurora's logical replication had known limitations with high-throughput workloads at the time.

2. Capture: Debezium PostgreSQL Connector

The Debezium connector reads the PostgreSQL WAL and publishes change events to Kafka:

{
  "name": "orders-postgres-source",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "shopstream-prod.abcdef.us-east-1.rds.amazonaws.com",
    "database.port": "5432",
    "database.user": "debezium_user",
    "database.password": "${file:/secrets/db-password.txt:password}",
    "database.dbname": "orders",
    "topic.prefix": "orders",
    "table.include.list": "public.orders,public.customers,public.line_items,public.products",
    "slot.name": "debezium_slot",
    "publication.name": "debezium_publication",
    "plugin.name": "pgoutput",
    "snapshot.mode": "initial",
    "snapshot.locking.mode": "none",
    "heartbeat.interval.ms": "30000",
    "decimal.handling.mode": "precise",
    "time.precision.mode": "adaptive",
    "tombstones.on.delete": "true",
    "key.converter": "io.confluent.connect.avro.AvroConverter",
    "key.converter.schema.registry.url": "http://schema-registry:8081",
    "value.converter": "io.confluent.connect.avro.AvroConverter",
    "value.converter.schema.registry.url": "http://schema-registry:8081",
    "transforms": "unwrap",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.drop.tombstones": "false",
    "transforms.unwrap.delete.handling.mode": "rewrite"
  }
}

Key Decisions:

snapshot.locking.mode: none prevents holding locks during initial snapshot
plugin.name: pgoutput uses PostgreSQL's native logical decoding (no external plugin needed)
tombstones.on.delete: true ensures deletes are properly propagated
Avro with Schema Registry for type safety and schema evolution

3. Stream: Kafka (MSK)

They chose AWS Managed Streaming for Kafka (MSK) to reduce operational burden:

Cluster Size: 3 brokers (m5.large) across 3 AZs
Storage: 1TB EBS per broker (gp3)
Topic Config:
- 3 partitions per topic (orders volume: ~2K events/sec peak)
- Replication factor: 3 (for durability)
- Retention: 7 days (compliance requirement)
- Compression: lz4 (balance of speed and ratio)

Topic Naming Convention: {database}.{schema}.{table}
Example: orders.public.orders, orders.public.customers

4. Sink: Snowflake Connector

The Snowflake Sink connector consumes from Kafka and applies changes to Snowflake:

{
  "name": "snowflake-sink",
  "config": {
    "connector.class": "com.snowflake.kafka.connector.SnowflakeSinkConnector",
    "topics": "orders.public.orders,orders.public.customers,orders.public.line_items,orders.public.products",
    "snowflake.url.name": "shopstream.snowflakecomputing.com",
    "snowflake.user.name": "kafka_loader",
    "snowflake.private.key": "${file:/secrets/snowflake-key.pem:key}",
    "snowflake.database.name": "ANALYTICS",
    "snowflake.schema.name": "ORDERS_CDC",
    "snowflake.role.name": "KAFKA_LOADER_ROLE",
    "buffer.count.records": "10000",
    "buffer.size.bytes": "5000000",
    "buffer.flush.time": "60",
    "snowflake.metadata.createtime": "false",
    "key.converter": "io.confluent.connect.avro.AvroConverter",
    "value.converter": "io.confluent.connect.avro.AvroConverter",
    "key.converter.schema.registry.url": "http://schema-registry:8081",
    "value.converter.schema.registry.url": "http://schema-registry:8081",
    "snowflake.enable.schematization": "true",
    "snowflake.topic2table.map": "orders.public.orders:ORDERS,orders.public.customers:CUSTOMERS"
  }
}

How it works: The connector stages micro-batches to S3, then loads them into Snowflake using COPY INTO. Snowflake then uses a stored procedure to MERGE staged records into the target tables.

5. Transformation: Kafka Streams (Optional)

For real-time aggregations, they built a Kafka Streams application:

// Real-time revenue by category (5-minute tumbling window)
KStream<String, Order> orders = builder.stream("orders.public.orders");
KStream<String, LineItem> lineItems = builder.stream("orders.public.line_items");

// Join orders with line items, aggregate by product category
KTable<Windowed<String>, RevenueMetric> revenueByCategory = orders
    .join(lineItems, 
        (order, lineItem) -> new OrderLineItem(order, lineItem),
        JoinWindows.of(Duration.ofMinutes(5)))
    .groupBy((key, orderLineItem) -> orderLineItem.getCategory())
    .windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
    .aggregate(
        RevenueMetric::new,
        (key, orderLineItem, agg) -> agg.add(orderLineItem.getTotal()),
        Materialized.with(Serdes.String(), new RevenueMetricSerde())
    );

// Expose via interactive queries for dashboard API
revenueByCategory.toStream()
    .to("revenue-by-category-windowed", Produced.with(WindowedSerdes.timeWindowedSerdeFrom(String.class), new RevenueMetricSerde()));

Part 4: Implementation Journey

Phase 1: Proof of Concept (2 Weeks)

The team started with a minimal viable pipeline to validate the approach:

Goal: Replicate a single table (orders) to Snowflake
Environment: Staging database, single-node Kafka (Docker Compose)
Success Criteria:
- Initial snapshot completes without locking production
- Change events appear in Snowflake within 60 seconds
- Schema change (ADD COLUMN) handled gracefully

Outcome: ✅ POC successful. Latency achieved: 15-30 seconds average.

Phase 2: Production Infrastructure (4 Weeks)

With POC validated, they built production-grade infrastructure:

Week 1-2: Kafka Setup

Provision MSK cluster (3 brokers, m5.large)
Set up Schema Registry (Confluent Cloud)
Configure networking (VPC peering, security groups)
Deploy Kafka Connect cluster (3 workers on ECS Fargate)

Week 3: Monitoring & Alerting

CloudWatch dashboards for lag, throughput, error rate
PagerDuty alerts for connector failures and high lag
Custom Lambda to check replication slot growth
Runbooks for common failure scenarios

Week 4: Testing & Validation

Chaos engineering: kill connectors, brokers, restart Postgres
Load testing: simulate Black Friday traffic (10x normal load)
Data quality checks: row counts, checksums, schema validation
Failover drills: practice recovery procedures

Phase 3: Gradual Migration (12 Weeks)

Rather than a "big bang" cutover, they migrated table-by-table:

Week 1-2: Core Tables (orders, line_items)

Run CDC pipeline in parallel with legacy ETL
Compare row counts and aggregate metrics daily
Analysts query both systems to verify consistency

Result: 0.02% discrepancy (due to timezone handling), fixed with transform

Week 3-4: Customer Tables (customers, addresses)

Handle PII fields (added field-level encryption in Kafka)
Validate GDPR delete propagation (tombstones → Snowflake DELETEs)

Result: Successfully replicated, compliance team approved

Week 5-8: Product & Inventory (products, inventory_levels)

High update frequency (inventory changes every few seconds)
Enabled compaction on Kafka topics to reduce storage
Tuned buffer sizes to balance latency and throughput

Challenge: Initial config caused 2-minute lag spikes during peak hours

Solution: Increased partition count from 3 to 6, reduced lag to <30s

Week 9-12: Long-Tail Tables & Decommission Legacy ETL

Migrated remaining 15 low-volume tables
Updated all Tableau dashboards to query CDC-populated tables
Ran dual systems for 2 weeks as a safety net
Decommissioned legacy ETL cron jobs and EC2 instances

Challenges Encountered

Challenge 1: Schema Evolution Gotcha

Incident: An engineer added a NOT NULL column to the orders table with a default value. Debezium events included the new column, but Snowflake's auto-create feature made it nullable. Downstream queries started failing on NULL values that shouldn't exist.

Solution: Implemented a pre-deployment check requiring schema changes to be coordinated with the data team. Added Snowflake table definitions to version control. Created a schema evolution guide (see Schema Evolution module).

Challenge 2: Replication Slot Bloat

Incident: During a Kafka Connect maintenance window, the Debezium connector was offline for 3 hours. The PostgreSQL replication slot accumulated 45GB of WAL files, causing disk space alerts.

Solution: Set wal_keep_size = 10GB on RDS to limit retention. Implemented monitoring to alert if slot lag exceeds 5GB. Created a runbook for safely dropping and recreating slots when lag is unrecoverable.

Challenge 3: Exactly-Once Semantics Confusion

Incident: After a Kafka broker failure and rebalance, some rows appeared duplicated in Snowflake. Investigation revealed the Snowflake connector's idempotency only works within a single file load, not across batches.

Solution: Implemented a post-load deduplication step using Snowflake's QUALIFY ROW_NUMBER() OVER (PARTITION BY primary_key ORDER BY _metadata_load_time DESC) = 1 pattern. Documented the limitation and created views that abstract the deduplication logic (see Exactly-Once module for detailed patterns).

Challenge 4: Monitoring Blind Spots

Incident: A silent failure in the Snowflake sink connector went unnoticed for 6 hours. The connector showed "RUNNING" status, but wasn't actually committing offsets or loading data. The team only discovered the issue when analysts reported stale dashboards.

Solution: Added end-to-end freshness checks: a canary record written to Postgres every minute, with an alert if it doesn't appear in Snowflake within 5 minutes. Implemented a "data heartbeat" table that gets updated continuously. Documented in Observability module.

Part 5: Outcomes & Impact

Quantified Results

24h → 3min

Data Latency (P95)

From nightly batch to near-real-time

-72%

Production DB Load

Eliminated nightly full-table scans

99.8%

Pipeline Uptime

vs 95% for legacy ETL (monthly avg)

$50K/mo

Infrastructure Cost

vs $42K/mo for legacy ETL (net +$8K)

-40%

Data Engineering Toil

Freed up 1.5 FTEs for new projects

18

New Real-Time Use Cases

Built on Kafka in first 6 months

Business Impact Stories

📈 Marketing: Real-Time Campaign Optimization

The marketing team can now see campaign performance in real-time. During a recent flash sale, they noticed one product category underperforming within 10 minutes of launch and quickly adjusted ad spend to better-performing categories. Result: 18% increase in ROAS for that campaign.

🛍️ Customer Experience: Live Order Tracking

Customer support agents now have access to real-time order status. When a customer calls asking "Where's my order?", agents see the latest information immediately, not yesterday's snapshot. Result: 15% reduction in "where is my order" call volume.

🔒 Fraud Prevention: Faster Detection

By streaming order events to a fraud detection model in real-time, the risk team can now flag suspicious patterns within seconds instead of hours. Result: Blocked 3 large fraudulent orders totaling $28K in the first month.

📊 Analytics: Self-Service Data Access

With Kafka topics as a canonical event stream, product teams can now subscribe to relevant data without filing tickets to the data engineering team. Result: 60% reduction in "data request" tickets.

Lessons Learned

✅ Start Small, Prove Value

The POC was critical for building confidence. By demonstrating value with a single table before committing to full migration, they secured buy-in from skeptical stakeholders.

✅ Invest in Observability Early

End-to-end monitoring (not just component-level metrics) prevented silent failures. The canary record pattern proved invaluable for catching subtle issues.

✅ Runbooks Are Worth Their Weight

Documenting recovery procedures before incidents meant faster MTTR and less panic. They practiced failure scenarios monthly to keep skills sharp.

⚠️ Don't Underestimate Schema Changes

Schema evolution was the most common source of production incidents. They eventually built automated schema compatibility checks into their CI/CD pipeline.

⚠️ Exactly-Once Is Harder Than It Looks

"Exactly-once" guarantees have subtle edge cases. Understanding where guarantees apply (and where they don't) is critical. Idempotent consumers are still necessary.

⚠️ Operational Complexity Is Real

Running Kafka required new skills and new on-call responsibilities. The team grew, and they hired a dedicated Kafka expert after 6 months. Budget accordingly.

"CDC transformed how we think about data. We went from asking 'What happened yesterday?' to 'What's happening right now?' That mindset shift unlocked opportunities we hadn't even imagined when we started." — Sarah Chen, VP of Data Engineering, ShopStream

Part 6: What's Next for ShopStream

With the CDC pipeline successfully deployed, ShopStream is now exploring advanced patterns:

Event-Driven Microservices: Migrating their monolithic order service to microservices using the Transactional Outbox pattern
Real-Time Personalization: Feeding customer behavior events into a feature store for ML-powered recommendations
Multi-Region Replication: Using CDC to replicate data to a new EU datacenter for GDPR compliance and lower latency
Data Mesh: Enabling domain teams to own their data products, with Kafka as the backbone for data sharing

Key Takeaways

🎯 Business Value First

ShopStream didn't adopt CDC for technical elegance—they did it to solve real business problems. The Black Friday incident gave them a clear ROI story to tell leadership.

🔧 Right Tool for the Job

They chose Debezium despite higher complexity because it aligned with their strategic direction. For teams without Kafka skills or different constraints, Fivetran or DMS might be the better choice.

📈 Incremental Migration

Table-by-table migration reduced risk and allowed learning from each phase. Running dual systems briefly was worth the redundancy for peace of mind.

👀 Observability Is Non-Negotiable

End-to-end monitoring caught issues that component metrics missed. The canary record pattern and freshness checks were game-changers.

📚 Documentation & Training

Runbooks, schema evolution guides, and failure drills made the team resilient. Investing in knowledge sharing paid dividends when incidents occurred.

🚀 CDC as a Platform

The real win wasn't just replacing ETL—it was building an event streaming platform that enabled 18 new use cases in 6 months. Think of CDC as infrastructure, not a point solution.

Explore Related Topics

This case study referenced concepts from across the CDC curriculum. Dive deeper into the specific patterns and practices that made ShopStream's implementation successful:

Real-World Case Study: E-Commerce CDC Pipeline

Part 1: The Business Context

Company Profile

The Pain Points

❌ Decision Latency (24+ Hours)

❌ Production Database Load

❌ Operational Brittleness

❌ Growing Complexity

The Breaking Point

Business Requirements

Part 2: Solution Evaluation

The Requirements

Tool Comparison

Option 1: Debezium (Open Source)

Option 2: Fivetran (Managed SaaS)

Option 3: AWS DMS (Cloud-Native)

The Decision: Debezium + Kafka

Part 3: Solution Architecture

The End-to-End Pipeline

Component Details

1. Source: PostgreSQL (RDS)

2. Capture: Debezium PostgreSQL Connector

3. Stream: Kafka (MSK)

4. Sink: Snowflake Connector

5. Transformation: Kafka Streams (Optional)

Part 4: Implementation Journey

Phase 1: Proof of Concept (2 Weeks)

Phase 2: Production Infrastructure (4 Weeks)

Phase 3: Gradual Migration (12 Weeks)

Week 1-2: Core Tables (orders, line_items)

Week 3-4: Customer Tables (customers, addresses)

Week 5-8: Product & Inventory (products, inventory_levels)

Week 9-12: Long-Tail Tables & Decommission Legacy ETL

Challenges Encountered

Challenge 1: Schema Evolution Gotcha

Challenge 2: Replication Slot Bloat

Challenge 3: Exactly-Once Semantics Confusion

Challenge 4: Monitoring Blind Spots

Part 5: Outcomes & Impact

Quantified Results

Business Impact Stories

📈 Marketing: Real-Time Campaign Optimization

🛍️ Customer Experience: Live Order Tracking

🔒 Fraud Prevention: Faster Detection

📊 Analytics: Self-Service Data Access

Lessons Learned

✅ Start Small, Prove Value

✅ Invest in Observability Early

✅ Runbooks Are Worth Their Weight

⚠️ Don't Underestimate Schema Changes

⚠️ Exactly-Once Is Harder Than It Looks

⚠️ Operational Complexity Is Real

Part 6: What's Next for ShopStream

Key Takeaways

🎯 Business Value First

🔧 Right Tool for the Job

📈 Incremental Migration

👀 Observability Is Non-Negotiable

📚 Documentation & Training

🚀 CDC as a Platform

Explore Related Topics

📚 Introduction to CDC

📸 Snapshotting

🔄 Schema Evolution

✅ Exactly-Once Semantics

👁️ Observability

🛠️ CDC Tooling

💡 Real-World Use Cases

🧪 Hands-On Lab

What is a common trigger for organizations to migrate from batch ETL to CDC?

In a typical CDC implementation, what is a key decision point when choosing between push and pull patterns?

What lesson is commonly learned during the initial snapshot phase of a CDC rollout?

What is a typical challenge when integrating CDC into existing data warehouse architectures?

What operational lesson is often learned after deploying CDC to production?