Intermediate

Real-World Case Study: E-Commerce CDC Pipeline

Follow a realistic CDC implementation journey: from slow batch ETL to real-time analytics. Learn how decision points, trade-offs, and practical challenges shape production deployments.

This case study follows ShopStream, a mid-sized e-commerce company, as they implement Change Data Capture to transform their data infrastructure. Their journey illustrates the real-world decision points, trade-offs, and challenges that teams face when moving from batch ETL to real-time data streaming.

Part 1: The Business Context

Company Profile

  • Industry: E-commerce (fashion & accessories)
  • Scale: 5M orders/year, 2M active customers, 50K SKUs
  • Tech Stack: PostgreSQL (OLTP), Snowflake (warehouse), Tableau (BI)
  • Team: 8 data engineers, 20 analysts, 200 total engineering

The Pain Points

ShopStream's analytics ran on a nightly batch ETL pipeline. Every night at 2 AM, a series of complex SQL queries would extract data from the production PostgreSQL database, transform it, and load it into Snowflake. This architecture had served them well during early growth, but was now causing significant problems:

❌ Decision Latency (24+ Hours)

The marketing team couldn't respond to real-time campaign performance. A flash sale that underperformed wouldn't be detected until the next day, missing the critical window for adjustments. Customer support couldn't see real-time order status, leading to frustrated customers and increased call volume.

❌ Production Database Load

The ETL queries scanned entire tables every night, causing noticeable performance degradation. During peak shopping seasons (Black Friday, holidays), the ETL window would extend into business hours, causing customer-facing slowdowns. The team had to carefully schedule maintenance windows around the ETL.

❌ Operational Brittleness

Failed ETL jobs meant analysts started their day with stale data and no clear ETA for fixes. Schema changes required coordinated updates across extraction scripts, transformation logic, and target tables. The on-call rotation dreaded weekends, when failures might not be noticed until Monday.

❌ Growing Complexity

As the business added features (subscriptions, gift cards, international shipping), the ETL became increasingly complex. The team spent more time maintaining and debugging ETL code than building new analytics capabilities. Technical debt was accumulating faster than they could pay it down.

The Breaking Point

The decision to investigate CDC came after a Black Friday incident. A pricing error on a high-margin product went undetected for 8 hours because the dashboard showing "revenue per category" wouldn't refresh until the next morning's ETL run. By the time the team noticed, the company had lost over $150K in margin.

"We realized we were flying blind. Our data was always yesterday's story, and in e-commerce, yesterday might as well be last year." β€” Sarah Chen, VP of Data Engineering, ShopStream

Business Requirements

The leadership team set clear goals for a new data architecture:

  • Sub-5-minute latency: Key metrics must reflect reality within 5 minutes
  • Zero impact to OLTP: No additional load on the production database
  • Incremental rollout: Migrate table-by-table, not a "big bang" cutover
  • Cost-neutral: New infrastructure costs offset by retiring legacy ETL infrastructure
  • 6-month timeline: Full migration complete within two quarters

Part 2: Solution Evaluation

The Requirements

The data engineering team evaluated solutions against a clear set of technical requirements:

  • Log-based capture: Must read from PostgreSQL WAL, not polling
  • Exactly-once delivery: No duplicate rows in Snowflake from replays
  • Schema evolution support: Handle ADD COLUMN, data type changes gracefully
  • Snapshot capability: Backfill historical data without impacting production
  • Operational maturity: Monitoring, alerting, and runbooks available

Tool Comparison

The team evaluated three major approaches, considering both technical fit and total cost of ownership:

Option 1: Debezium (Open Source)

Architecture: Self-hosted Kafka + Kafka Connect + Debezium PostgreSQL connector + Snowflake Sink

Pros:
  • No licensing costs (Apache 2.0)
  • Full control over configuration and scaling
  • Battle-tested for PostgreSQL CDC
  • Large community and extensive documentation
  • Can handle complex transformations in Kafka Streams
Cons:
  • Requires Kafka expertise on the team
  • Operational burden: patching, upgrades, capacity planning
  • Need to build monitoring and alerting from scratch
  • Estimated 2 FTEs for ongoing operations

Estimated Cost: $15K/month (infrastructure) + 2 FTEs (~$35K/month fully loaded) = $50K/month

Option 2: Fivetran (Managed SaaS)

Architecture: Fivetran PostgreSQL connector β†’ Fivetran cloud β†’ Snowflake

Pros:
  • Zero infrastructure management
  • Fast time-to-value (setup in hours)
  • Built-in monitoring and alerting
  • Automatic schema drift handling
  • 24/7 support from vendor
Cons:
  • Pricing based on Monthly Active Rows (MAR)
  • Less control over data transformations
  • Vendor lock-in concerns
  • Limited customization options
  • Data flows through third-party infrastructure

Estimated Cost: $25K/month (based on projected MAR) + 0.5 FTE (~$9K/month) = $34K/month

Option 3: AWS DMS (Cloud-Native)

Architecture: DMS PostgreSQL source β†’ DMS replication instance β†’ Snowflake via S3 staging

Pros:
  • Fully managed by AWS
  • Native integration with existing AWS infrastructure
  • Granular cost control (pay per instance hour)
  • CloudWatch integration for monitoring
Cons:
  • Less mature CDC support than Debezium
  • Limited transformation capabilities
  • Occasional consistency issues reported in community
  • Requires S3 staging bucket for Snowflake (added complexity)
  • Snowflake connector requires custom integration via S3

Estimated Cost: $12K/month (DMS + S3 + data transfer) + 1 FTE (~$18K/month) = $30K/month

The Decision: Debezium + Kafka

After careful evaluation, the team chose Debezium on Kafka, despite the higher operational complexity. The deciding factors were:

  1. Strategic Investment: Building Kafka expertise would pay dividends beyond CDC. The team had already identified multiple use cases for event streaming (inventory updates, order notifications, fraud detection).
  2. Flexibility: Open-source meant no limits on customization. They could handle complex scenarios like multi-tenancy and custom transformations.
  3. Cost Trajectory: While initially more expensive, costs would remain predictable as data volumes grew, unlike MAR-based pricing.
  4. Team Skills: Two senior engineers had prior Kafka experience, reducing the ramp-up risk.
"We knew Debezium would be harder upfront, but we were building a capability, not just solving a point problem. Kafka would become our event backbone." β€” Marcus Rodriguez, Principal Engineer, ShopStream

Part 3: Solution Architecture

The End-to-End Pipeline

The final architecture consists of five major components, each playing a specific role in the data flow:

Component Details

1. Source: PostgreSQL (RDS)

The production database remained unchanged, except for enabling logical replication:

-- Enable logical replication (requires restart)
ALTER SYSTEM SET wal_level = 'logical';
ALTER SYSTEM SET max_replication_slots = 10;
ALTER SYSTEM SET max_wal_senders = 10;

-- Create replication slot for Debezium
SELECT pg_create_logical_replication_slot('debezium_slot', 'pgoutput');

-- Grant necessary permissions
GRANT SELECT ON ALL TABLES IN SCHEMA public TO debezium_user;
GRANT USAGE ON SCHEMA public TO debezium_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO debezium_user;

Key Decision: They used RDS PostgreSQL (not Aurora) because Aurora's logical replication had known limitations with high-throughput workloads at the time.

2. Capture: Debezium PostgreSQL Connector

The Debezium connector reads the PostgreSQL WAL and publishes change events to Kafka:

{
  "name": "orders-postgres-source",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "shopstream-prod.abcdef.us-east-1.rds.amazonaws.com",
    "database.port": "5432",
    "database.user": "debezium_user",
    "database.password": "${file:/secrets/db-password.txt:password}",
    "database.dbname": "orders",
    "topic.prefix": "orders",
    "table.include.list": "public.orders,public.customers,public.line_items,public.products",
    "slot.name": "debezium_slot",
    "publication.name": "debezium_publication",
    "plugin.name": "pgoutput",
    "snapshot.mode": "initial",
    "snapshot.locking.mode": "none",
    "heartbeat.interval.ms": "30000",
    "decimal.handling.mode": "precise",
    "time.precision.mode": "adaptive",
    "tombstones.on.delete": "true",
    "key.converter": "io.confluent.connect.avro.AvroConverter",
    "key.converter.schema.registry.url": "http://schema-registry:8081",
    "value.converter": "io.confluent.connect.avro.AvroConverter",
    "value.converter.schema.registry.url": "http://schema-registry:8081",
    "transforms": "unwrap",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.drop.tombstones": "false",
    "transforms.unwrap.delete.handling.mode": "rewrite"
  }
}

Key Decisions:

  • snapshot.locking.mode: none prevents holding locks during initial snapshot
  • plugin.name: pgoutput uses PostgreSQL's native logical decoding (no external plugin needed)
  • tombstones.on.delete: true ensures deletes are properly propagated
  • Avro with Schema Registry for type safety and schema evolution

3. Stream: Kafka (MSK)

They chose AWS Managed Streaming for Kafka (MSK) to reduce operational burden:

  • Cluster Size: 3 brokers (m5.large) across 3 AZs
  • Storage: 1TB EBS per broker (gp3)
  • Topic Config:
    • 3 partitions per topic (orders volume: ~2K events/sec peak)
    • Replication factor: 3 (for durability)
    • Retention: 7 days (compliance requirement)
    • Compression: lz4 (balance of speed and ratio)

Topic Naming Convention: {database}.{schema}.{table}
Example: orders.public.orders, orders.public.customers

4. Sink: Snowflake Connector

The Snowflake Sink connector consumes from Kafka and applies changes to Snowflake:

{
  "name": "snowflake-sink",
  "config": {
    "connector.class": "com.snowflake.kafka.connector.SnowflakeSinkConnector",
    "topics": "orders.public.orders,orders.public.customers,orders.public.line_items,orders.public.products",
    "snowflake.url.name": "shopstream.snowflakecomputing.com",
    "snowflake.user.name": "kafka_loader",
    "snowflake.private.key": "${file:/secrets/snowflake-key.pem:key}",
    "snowflake.database.name": "ANALYTICS",
    "snowflake.schema.name": "ORDERS_CDC",
    "snowflake.role.name": "KAFKA_LOADER_ROLE",
    "buffer.count.records": "10000",
    "buffer.size.bytes": "5000000",
    "buffer.flush.time": "60",
    "snowflake.metadata.createtime": "false",
    "key.converter": "io.confluent.connect.avro.AvroConverter",
    "value.converter": "io.confluent.connect.avro.AvroConverter",
    "key.converter.schema.registry.url": "http://schema-registry:8081",
    "value.converter.schema.registry.url": "http://schema-registry:8081",
    "snowflake.enable.schematization": "true",
    "snowflake.topic2table.map": "orders.public.orders:ORDERS,orders.public.customers:CUSTOMERS"
  }
}

How it works: The connector stages micro-batches to S3, then loads them into Snowflake using COPY INTO. Snowflake then uses a stored procedure to MERGE staged records into the target tables.

5. Transformation: Kafka Streams (Optional)

For real-time aggregations, they built a Kafka Streams application:

// Real-time revenue by category (5-minute tumbling window)
KStream<String, Order> orders = builder.stream("orders.public.orders");
KStream<String, LineItem> lineItems = builder.stream("orders.public.line_items");

// Join orders with line items, aggregate by product category
KTable<Windowed<String>, RevenueMetric> revenueByCategory = orders
    .join(lineItems, 
        (order, lineItem) -> new OrderLineItem(order, lineItem),
        JoinWindows.of(Duration.ofMinutes(5)))
    .groupBy((key, orderLineItem) -> orderLineItem.getCategory())
    .windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
    .aggregate(
        RevenueMetric::new,
        (key, orderLineItem, agg) -> agg.add(orderLineItem.getTotal()),
        Materialized.with(Serdes.String(), new RevenueMetricSerde())
    );

// Expose via interactive queries for dashboard API
revenueByCategory.toStream()
    .to("revenue-by-category-windowed", Produced.with(WindowedSerdes.timeWindowedSerdeFrom(String.class), new RevenueMetricSerde()));

Part 4: Implementation Journey

Phase 1: Proof of Concept (2 Weeks)

The team started with a minimal viable pipeline to validate the approach:

  • Goal: Replicate a single table (orders) to Snowflake
  • Environment: Staging database, single-node Kafka (Docker Compose)
  • Success Criteria:
    • Initial snapshot completes without locking production
    • Change events appear in Snowflake within 60 seconds
    • Schema change (ADD COLUMN) handled gracefully

Outcome: βœ… POC successful. Latency achieved: 15-30 seconds average.

Phase 2: Production Infrastructure (4 Weeks)

With POC validated, they built production-grade infrastructure:

Week 1-2: Kafka Setup
  • Provision MSK cluster (3 brokers, m5.large)
  • Set up Schema Registry (Confluent Cloud)
  • Configure networking (VPC peering, security groups)
  • Deploy Kafka Connect cluster (3 workers on ECS Fargate)
Week 3: Monitoring & Alerting
  • CloudWatch dashboards for lag, throughput, error rate
  • PagerDuty alerts for connector failures and high lag
  • Custom Lambda to check replication slot growth
  • Runbooks for common failure scenarios
Week 4: Testing & Validation
  • Chaos engineering: kill connectors, brokers, restart Postgres
  • Load testing: simulate Black Friday traffic (10x normal load)
  • Data quality checks: row counts, checksums, schema validation
  • Failover drills: practice recovery procedures

Phase 3: Gradual Migration (12 Weeks)

Rather than a "big bang" cutover, they migrated table-by-table:

Week 1-2: Core Tables (orders, line_items)

  • Run CDC pipeline in parallel with legacy ETL
  • Compare row counts and aggregate metrics daily
  • Analysts query both systems to verify consistency

Result: 0.02% discrepancy (due to timezone handling), fixed with transform

Week 3-4: Customer Tables (customers, addresses)

  • Handle PII fields (added field-level encryption in Kafka)
  • Validate GDPR delete propagation (tombstones β†’ Snowflake DELETEs)

Result: Successfully replicated, compliance team approved

Week 5-8: Product & Inventory (products, inventory_levels)

  • High update frequency (inventory changes every few seconds)
  • Enabled compaction on Kafka topics to reduce storage
  • Tuned buffer sizes to balance latency and throughput

Challenge: Initial config caused 2-minute lag spikes during peak hours

Solution: Increased partition count from 3 to 6, reduced lag to <30s

Week 9-12: Long-Tail Tables & Decommission Legacy ETL

  • Migrated remaining 15 low-volume tables
  • Updated all Tableau dashboards to query CDC-populated tables
  • Ran dual systems for 2 weeks as a safety net
  • Decommissioned legacy ETL cron jobs and EC2 instances

Challenges Encountered

Challenge 1: Schema Evolution Gotcha

Incident: An engineer added a NOT NULL column to the orders table with a default value. Debezium events included the new column, but Snowflake's auto-create feature made it nullable. Downstream queries started failing on NULL values that shouldn't exist.

Solution: Implemented a pre-deployment check requiring schema changes to be coordinated with the data team. Added Snowflake table definitions to version control. Created a schema evolution guide (see Schema Evolution module).

Challenge 2: Replication Slot Bloat

Incident: During a Kafka Connect maintenance window, the Debezium connector was offline for 3 hours. The PostgreSQL replication slot accumulated 45GB of WAL files, causing disk space alerts.

Solution: Set wal_keep_size = 10GB on RDS to limit retention. Implemented monitoring to alert if slot lag exceeds 5GB. Created a runbook for safely dropping and recreating slots when lag is unrecoverable.

Challenge 3: Exactly-Once Semantics Confusion

Incident: After a Kafka broker failure and rebalance, some rows appeared duplicated in Snowflake. Investigation revealed the Snowflake connector's idempotency only works within a single file load, not across batches.

Solution: Implemented a post-load deduplication step using Snowflake's QUALIFY ROW_NUMBER() OVER (PARTITION BY primary_key ORDER BY _metadata_load_time DESC) = 1 pattern. Documented the limitation and created views that abstract the deduplication logic (see Exactly-Once module for detailed patterns).

Challenge 4: Monitoring Blind Spots

Incident: A silent failure in the Snowflake sink connector went unnoticed for 6 hours. The connector showed "RUNNING" status, but wasn't actually committing offsets or loading data. The team only discovered the issue when analysts reported stale dashboards.

Solution: Added end-to-end freshness checks: a canary record written to Postgres every minute, with an alert if it doesn't appear in Snowflake within 5 minutes. Implemented a "data heartbeat" table that gets updated continuously. Documented in Observability module.

Part 5: Outcomes & Impact

Quantified Results

24h β†’ 3min
Data Latency (P95)
From nightly batch to near-real-time
-72%
Production DB Load
Eliminated nightly full-table scans
99.8%
Pipeline Uptime
vs 95% for legacy ETL (monthly avg)
$50K/mo
Infrastructure Cost
vs $42K/mo for legacy ETL (net +$8K)
-40%
Data Engineering Toil
Freed up 1.5 FTEs for new projects
18
New Real-Time Use Cases
Built on Kafka in first 6 months

Business Impact Stories

πŸ“ˆ Marketing: Real-Time Campaign Optimization

The marketing team can now see campaign performance in real-time. During a recent flash sale, they noticed one product category underperforming within 10 minutes of launch and quickly adjusted ad spend to better-performing categories. Result: 18% increase in ROAS for that campaign.

πŸ›οΈ Customer Experience: Live Order Tracking

Customer support agents now have access to real-time order status. When a customer calls asking "Where's my order?", agents see the latest information immediately, not yesterday's snapshot. Result: 15% reduction in "where is my order" call volume.

πŸ”’ Fraud Prevention: Faster Detection

By streaming order events to a fraud detection model in real-time, the risk team can now flag suspicious patterns within seconds instead of hours. Result: Blocked 3 large fraudulent orders totaling $28K in the first month.

πŸ“Š Analytics: Self-Service Data Access

With Kafka topics as a canonical event stream, product teams can now subscribe to relevant data without filing tickets to the data engineering team. Result: 60% reduction in "data request" tickets.

Lessons Learned

βœ… Start Small, Prove Value

The POC was critical for building confidence. By demonstrating value with a single table before committing to full migration, they secured buy-in from skeptical stakeholders.

βœ… Invest in Observability Early

End-to-end monitoring (not just component-level metrics) prevented silent failures. The canary record pattern proved invaluable for catching subtle issues.

βœ… Runbooks Are Worth Their Weight

Documenting recovery procedures before incidents meant faster MTTR and less panic. They practiced failure scenarios monthly to keep skills sharp.

⚠️ Don't Underestimate Schema Changes

Schema evolution was the most common source of production incidents. They eventually built automated schema compatibility checks into their CI/CD pipeline.

⚠️ Exactly-Once Is Harder Than It Looks

"Exactly-once" guarantees have subtle edge cases. Understanding where guarantees apply (and where they don't) is critical. Idempotent consumers are still necessary.

⚠️ Operational Complexity Is Real

Running Kafka required new skills and new on-call responsibilities. The team grew, and they hired a dedicated Kafka expert after 6 months. Budget accordingly.

"CDC transformed how we think about data. We went from asking 'What happened yesterday?' to 'What's happening right now?' That mindset shift unlocked opportunities we hadn't even imagined when we started." β€” Sarah Chen, VP of Data Engineering, ShopStream

Part 6: What's Next for ShopStream

With the CDC pipeline successfully deployed, ShopStream is now exploring advanced patterns:

  • Event-Driven Microservices: Migrating their monolithic order service to microservices using the Transactional Outbox pattern
  • Real-Time Personalization: Feeding customer behavior events into a feature store for ML-powered recommendations
  • Multi-Region Replication: Using CDC to replicate data to a new EU datacenter for GDPR compliance and lower latency
  • Data Mesh: Enabling domain teams to own their data products, with Kafka as the backbone for data sharing

Key Takeaways

🎯 Business Value First

ShopStream didn't adopt CDC for technical eleganceβ€”they did it to solve real business problems. The Black Friday incident gave them a clear ROI story to tell leadership.

πŸ”§ Right Tool for the Job

They chose Debezium despite higher complexity because it aligned with their strategic direction. For teams without Kafka skills or different constraints, Fivetran or DMS might be the better choice.

πŸ“ˆ Incremental Migration

Table-by-table migration reduced risk and allowed learning from each phase. Running dual systems briefly was worth the redundancy for peace of mind.

πŸ‘€ Observability Is Non-Negotiable

End-to-end monitoring caught issues that component metrics missed. The canary record pattern and freshness checks were game-changers.

πŸ“š Documentation & Training

Runbooks, schema evolution guides, and failure drills made the team resilient. Investing in knowledge sharing paid dividends when incidents occurred.

πŸš€ CDC as a Platform

The real win wasn't just replacing ETLβ€”it was building an event streaming platform that enabled 18 new use cases in 6 months. Think of CDC as infrastructure, not a point solution.

Case Study Knowledge Check

Test your understanding of real-world CDC implementation patterns and lessons learned.

Q1

What is a common trigger for organizations to migrate from batch ETL to CDC?

Q2

In a typical CDC implementation, what is a key decision point when choosing between push and pull patterns?

Q3

What lesson is commonly learned during the initial snapshot phase of a CDC rollout?

Q4

What is a typical challenge when integrating CDC into existing data warehouse architectures?

Q5

What operational lesson is often learned after deploying CDC to production?

0/5 correct
Progress 0% No progress yet
Progress is stored locally in this browser.
Progress 0% No progress yet
Progress is stored locally in this browser.