This case study follows ShopStream, a mid-sized e-commerce company, as they implement Change Data Capture to transform their data infrastructure. Their journey illustrates the real-world decision points, trade-offs, and challenges that teams face when moving from batch ETL to real-time data streaming.
Part 1: The Business Context
Company Profile
- Industry: E-commerce (fashion & accessories)
- Scale: 5M orders/year, 2M active customers, 50K SKUs
- Tech Stack: PostgreSQL (OLTP), Snowflake (warehouse), Tableau (BI)
- Team: 8 data engineers, 20 analysts, 200 total engineering
The Pain Points
ShopStream's analytics ran on a nightly batch ETL pipeline. Every night at 2 AM, a series of complex SQL queries would extract data from the production PostgreSQL database, transform it, and load it into Snowflake. This architecture had served them well during early growth, but was now causing significant problems:
β Decision Latency (24+ Hours)
The marketing team couldn't respond to real-time campaign performance. A flash sale that underperformed wouldn't be detected until the next day, missing the critical window for adjustments. Customer support couldn't see real-time order status, leading to frustrated customers and increased call volume.
β Production Database Load
The ETL queries scanned entire tables every night, causing noticeable performance degradation. During peak shopping seasons (Black Friday, holidays), the ETL window would extend into business hours, causing customer-facing slowdowns. The team had to carefully schedule maintenance windows around the ETL.
β Operational Brittleness
Failed ETL jobs meant analysts started their day with stale data and no clear ETA for fixes. Schema changes required coordinated updates across extraction scripts, transformation logic, and target tables. The on-call rotation dreaded weekends, when failures might not be noticed until Monday.
β Growing Complexity
As the business added features (subscriptions, gift cards, international shipping), the ETL became increasingly complex. The team spent more time maintaining and debugging ETL code than building new analytics capabilities. Technical debt was accumulating faster than they could pay it down.
The Breaking Point
The decision to investigate CDC came after a Black Friday incident. A pricing error on a high-margin product went undetected for 8 hours because the dashboard showing "revenue per category" wouldn't refresh until the next morning's ETL run. By the time the team noticed, the company had lost over $150K in margin.
"We realized we were flying blind. Our data was always yesterday's story, and in e-commerce, yesterday might as well be last year." β Sarah Chen, VP of Data Engineering, ShopStream
Business Requirements
The leadership team set clear goals for a new data architecture:
- Sub-5-minute latency: Key metrics must reflect reality within 5 minutes
- Zero impact to OLTP: No additional load on the production database
- Incremental rollout: Migrate table-by-table, not a "big bang" cutover
- Cost-neutral: New infrastructure costs offset by retiring legacy ETL infrastructure
- 6-month timeline: Full migration complete within two quarters
Part 2: Solution Evaluation
The Requirements
The data engineering team evaluated solutions against a clear set of technical requirements:
- Log-based capture: Must read from PostgreSQL WAL, not polling
- Exactly-once delivery: No duplicate rows in Snowflake from replays
- Schema evolution support: Handle ADD COLUMN, data type changes gracefully
- Snapshot capability: Backfill historical data without impacting production
- Operational maturity: Monitoring, alerting, and runbooks available
Tool Comparison
The team evaluated three major approaches, considering both technical fit and total cost of ownership:
Option 1: Debezium (Open Source)
Architecture: Self-hosted Kafka + Kafka Connect + Debezium PostgreSQL connector + Snowflake Sink
- No licensing costs (Apache 2.0)
- Full control over configuration and scaling
- Battle-tested for PostgreSQL CDC
- Large community and extensive documentation
- Can handle complex transformations in Kafka Streams
- Requires Kafka expertise on the team
- Operational burden: patching, upgrades, capacity planning
- Need to build monitoring and alerting from scratch
- Estimated 2 FTEs for ongoing operations
Estimated Cost: $15K/month (infrastructure) + 2 FTEs (~$35K/month fully loaded) = $50K/month
Option 2: Fivetran (Managed SaaS)
Architecture: Fivetran PostgreSQL connector β Fivetran cloud β Snowflake
- Zero infrastructure management
- Fast time-to-value (setup in hours)
- Built-in monitoring and alerting
- Automatic schema drift handling
- 24/7 support from vendor
- Pricing based on Monthly Active Rows (MAR)
- Less control over data transformations
- Vendor lock-in concerns
- Limited customization options
- Data flows through third-party infrastructure
Estimated Cost: $25K/month (based on projected MAR) + 0.5 FTE (~$9K/month) = $34K/month
Option 3: AWS DMS (Cloud-Native)
Architecture: DMS PostgreSQL source β DMS replication instance β Snowflake via S3 staging
- Fully managed by AWS
- Native integration with existing AWS infrastructure
- Granular cost control (pay per instance hour)
- CloudWatch integration for monitoring
- Less mature CDC support than Debezium
- Limited transformation capabilities
- Occasional consistency issues reported in community
- Requires S3 staging bucket for Snowflake (added complexity)
- Snowflake connector requires custom integration via S3
Estimated Cost: $12K/month (DMS + S3 + data transfer) + 1 FTE (~$18K/month) = $30K/month
The Decision: Debezium + Kafka
After careful evaluation, the team chose Debezium on Kafka, despite the higher operational complexity. The deciding factors were:
- Strategic Investment: Building Kafka expertise would pay dividends beyond CDC. The team had already identified multiple use cases for event streaming (inventory updates, order notifications, fraud detection).
- Flexibility: Open-source meant no limits on customization. They could handle complex scenarios like multi-tenancy and custom transformations.
- Cost Trajectory: While initially more expensive, costs would remain predictable as data volumes grew, unlike MAR-based pricing.
- Team Skills: Two senior engineers had prior Kafka experience, reducing the ramp-up risk.
"We knew Debezium would be harder upfront, but we were building a capability, not just solving a point problem. Kafka would become our event backbone." β Marcus Rodriguez, Principal Engineer, ShopStream
Part 3: Solution Architecture
The End-to-End Pipeline
The final architecture consists of five major components, each playing a specific role in the data flow:
graph LR
A[PostgreSQL OLTP
Orders, Customers, Products] -->|WAL Stream| B[Debezium Connector
Kafka Connect]
B -->|Change Events
JSON/Avro| C[Kafka Topics
orders.public.orders
orders.public.customers]
C -->|Stream| D[Snowflake Sink
Kafka Connect]
C -->|Real-time| E[Kafka Streams
Aggregations]
D -->|MERGE/UPSERT| F[Snowflake
Analytics Tables]
E -->|Live Metrics| G[Redis Cache
Dashboard API]
F -->|Historical Analysis| H[Tableau
BI Dashboards]
G -->|Current State| H
style A fill:#e1f5ff
style C fill:#fff3cd
style F fill:#d4edda
style G fill:#f8d7da
Component Details
1. Source: PostgreSQL (RDS)
The production database remained unchanged, except for enabling logical replication:
-- Enable logical replication (requires restart)
ALTER SYSTEM SET wal_level = 'logical';
ALTER SYSTEM SET max_replication_slots = 10;
ALTER SYSTEM SET max_wal_senders = 10;
-- Create replication slot for Debezium
SELECT pg_create_logical_replication_slot('debezium_slot', 'pgoutput');
-- Grant necessary permissions
GRANT SELECT ON ALL TABLES IN SCHEMA public TO debezium_user;
GRANT USAGE ON SCHEMA public TO debezium_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO debezium_user;
Key Decision: They used RDS PostgreSQL (not Aurora) because Aurora's logical replication had known limitations with high-throughput workloads at the time.
2. Capture: Debezium PostgreSQL Connector
The Debezium connector reads the PostgreSQL WAL and publishes change events to Kafka:
{
"name": "orders-postgres-source",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "shopstream-prod.abcdef.us-east-1.rds.amazonaws.com",
"database.port": "5432",
"database.user": "debezium_user",
"database.password": "${file:/secrets/db-password.txt:password}",
"database.dbname": "orders",
"topic.prefix": "orders",
"table.include.list": "public.orders,public.customers,public.line_items,public.products",
"slot.name": "debezium_slot",
"publication.name": "debezium_publication",
"plugin.name": "pgoutput",
"snapshot.mode": "initial",
"snapshot.locking.mode": "none",
"heartbeat.interval.ms": "30000",
"decimal.handling.mode": "precise",
"time.precision.mode": "adaptive",
"tombstones.on.delete": "true",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "http://schema-registry:8081",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://schema-registry:8081",
"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.drop.tombstones": "false",
"transforms.unwrap.delete.handling.mode": "rewrite"
}
}
Key Decisions:
snapshot.locking.mode: noneprevents holding locks during initial snapshotplugin.name: pgoutputuses PostgreSQL's native logical decoding (no external plugin needed)tombstones.on.delete: trueensures deletes are properly propagated- Avro with Schema Registry for type safety and schema evolution
3. Stream: Kafka (MSK)
They chose AWS Managed Streaming for Kafka (MSK) to reduce operational burden:
- Cluster Size: 3 brokers (m5.large) across 3 AZs
- Storage: 1TB EBS per broker (gp3)
- Topic Config:
- 3 partitions per topic (orders volume: ~2K events/sec peak)
- Replication factor: 3 (for durability)
- Retention: 7 days (compliance requirement)
- Compression: lz4 (balance of speed and ratio)
Topic Naming Convention: {database}.{schema}.{table}
Example: orders.public.orders, orders.public.customers
4. Sink: Snowflake Connector
The Snowflake Sink connector consumes from Kafka and applies changes to Snowflake:
{
"name": "snowflake-sink",
"config": {
"connector.class": "com.snowflake.kafka.connector.SnowflakeSinkConnector",
"topics": "orders.public.orders,orders.public.customers,orders.public.line_items,orders.public.products",
"snowflake.url.name": "shopstream.snowflakecomputing.com",
"snowflake.user.name": "kafka_loader",
"snowflake.private.key": "${file:/secrets/snowflake-key.pem:key}",
"snowflake.database.name": "ANALYTICS",
"snowflake.schema.name": "ORDERS_CDC",
"snowflake.role.name": "KAFKA_LOADER_ROLE",
"buffer.count.records": "10000",
"buffer.size.bytes": "5000000",
"buffer.flush.time": "60",
"snowflake.metadata.createtime": "false",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "http://schema-registry:8081",
"value.converter.schema.registry.url": "http://schema-registry:8081",
"snowflake.enable.schematization": "true",
"snowflake.topic2table.map": "orders.public.orders:ORDERS,orders.public.customers:CUSTOMERS"
}
}
How it works: The connector stages micro-batches to S3, then loads
them into Snowflake using COPY INTO. Snowflake then uses a stored procedure
to MERGE staged records into the target tables.
5. Transformation: Kafka Streams (Optional)
For real-time aggregations, they built a Kafka Streams application:
// Real-time revenue by category (5-minute tumbling window)
KStream<String, Order> orders = builder.stream("orders.public.orders");
KStream<String, LineItem> lineItems = builder.stream("orders.public.line_items");
// Join orders with line items, aggregate by product category
KTable<Windowed<String>, RevenueMetric> revenueByCategory = orders
.join(lineItems,
(order, lineItem) -> new OrderLineItem(order, lineItem),
JoinWindows.of(Duration.ofMinutes(5)))
.groupBy((key, orderLineItem) -> orderLineItem.getCategory())
.windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
.aggregate(
RevenueMetric::new,
(key, orderLineItem, agg) -> agg.add(orderLineItem.getTotal()),
Materialized.with(Serdes.String(), new RevenueMetricSerde())
);
// Expose via interactive queries for dashboard API
revenueByCategory.toStream()
.to("revenue-by-category-windowed", Produced.with(WindowedSerdes.timeWindowedSerdeFrom(String.class), new RevenueMetricSerde()));
Part 4: Implementation Journey
Phase 1: Proof of Concept (2 Weeks)
The team started with a minimal viable pipeline to validate the approach:
- Goal: Replicate a single table (
orders) to Snowflake - Environment: Staging database, single-node Kafka (Docker Compose)
- Success Criteria:
- Initial snapshot completes without locking production
- Change events appear in Snowflake within 60 seconds
- Schema change (ADD COLUMN) handled gracefully
Outcome: β POC successful. Latency achieved: 15-30 seconds average.
Phase 2: Production Infrastructure (4 Weeks)
With POC validated, they built production-grade infrastructure:
- Provision MSK cluster (3 brokers, m5.large)
- Set up Schema Registry (Confluent Cloud)
- Configure networking (VPC peering, security groups)
- Deploy Kafka Connect cluster (3 workers on ECS Fargate)
- CloudWatch dashboards for lag, throughput, error rate
- PagerDuty alerts for connector failures and high lag
- Custom Lambda to check replication slot growth
- Runbooks for common failure scenarios
- Chaos engineering: kill connectors, brokers, restart Postgres
- Load testing: simulate Black Friday traffic (10x normal load)
- Data quality checks: row counts, checksums, schema validation
- Failover drills: practice recovery procedures
Phase 3: Gradual Migration (12 Weeks)
Rather than a "big bang" cutover, they migrated table-by-table:
Week 1-2: Core Tables (orders, line_items)
- Run CDC pipeline in parallel with legacy ETL
- Compare row counts and aggregate metrics daily
- Analysts query both systems to verify consistency
Result: 0.02% discrepancy (due to timezone handling), fixed with transform
Week 3-4: Customer Tables (customers, addresses)
- Handle PII fields (added field-level encryption in Kafka)
- Validate GDPR delete propagation (tombstones β Snowflake DELETEs)
Result: Successfully replicated, compliance team approved
Week 5-8: Product & Inventory (products, inventory_levels)
- High update frequency (inventory changes every few seconds)
- Enabled compaction on Kafka topics to reduce storage
- Tuned buffer sizes to balance latency and throughput
Challenge: Initial config caused 2-minute lag spikes during peak hours
Solution: Increased partition count from 3 to 6, reduced lag to <30s
Week 9-12: Long-Tail Tables & Decommission Legacy ETL
- Migrated remaining 15 low-volume tables
- Updated all Tableau dashboards to query CDC-populated tables
- Ran dual systems for 2 weeks as a safety net
- Decommissioned legacy ETL cron jobs and EC2 instances
Challenges Encountered
Challenge 1: Schema Evolution Gotcha
Incident: An engineer added a NOT NULL column to the
orders table with a default value. Debezium events included the new column,
but Snowflake's auto-create feature made it nullable. Downstream queries started failing
on NULL values that shouldn't exist.
Solution: Implemented a pre-deployment check requiring schema changes to be coordinated with the data team. Added Snowflake table definitions to version control. Created a schema evolution guide (see Schema Evolution module).
Challenge 2: Replication Slot Bloat
Incident: During a Kafka Connect maintenance window, the Debezium connector was offline for 3 hours. The PostgreSQL replication slot accumulated 45GB of WAL files, causing disk space alerts.
Solution: Set wal_keep_size = 10GB on RDS to limit retention.
Implemented monitoring to alert if slot lag exceeds 5GB. Created a runbook for safely
dropping and recreating slots when lag is unrecoverable.
Challenge 3: Exactly-Once Semantics Confusion
Incident: After a Kafka broker failure and rebalance, some rows appeared duplicated in Snowflake. Investigation revealed the Snowflake connector's idempotency only works within a single file load, not across batches.
Solution: Implemented a post-load deduplication step using Snowflake's
QUALIFY ROW_NUMBER() OVER (PARTITION BY primary_key ORDER BY _metadata_load_time DESC) = 1
pattern. Documented the limitation and created views that abstract the deduplication logic
(see Exactly-Once module for detailed patterns).
Challenge 4: Monitoring Blind Spots
Incident: A silent failure in the Snowflake sink connector went unnoticed for 6 hours. The connector showed "RUNNING" status, but wasn't actually committing offsets or loading data. The team only discovered the issue when analysts reported stale dashboards.
Solution: Added end-to-end freshness checks: a canary record written to Postgres every minute, with an alert if it doesn't appear in Snowflake within 5 minutes. Implemented a "data heartbeat" table that gets updated continuously. Documented in Observability module.
Part 5: Outcomes & Impact
Quantified Results
Business Impact Stories
Lessons Learned
β Start Small, Prove Value
The POC was critical for building confidence. By demonstrating value with a single table before committing to full migration, they secured buy-in from skeptical stakeholders.
β Invest in Observability Early
End-to-end monitoring (not just component-level metrics) prevented silent failures. The canary record pattern proved invaluable for catching subtle issues.
β Runbooks Are Worth Their Weight
Documenting recovery procedures before incidents meant faster MTTR and less panic. They practiced failure scenarios monthly to keep skills sharp.
β οΈ Don't Underestimate Schema Changes
Schema evolution was the most common source of production incidents. They eventually built automated schema compatibility checks into their CI/CD pipeline.
β οΈ Exactly-Once Is Harder Than It Looks
"Exactly-once" guarantees have subtle edge cases. Understanding where guarantees apply (and where they don't) is critical. Idempotent consumers are still necessary.
β οΈ Operational Complexity Is Real
Running Kafka required new skills and new on-call responsibilities. The team grew, and they hired a dedicated Kafka expert after 6 months. Budget accordingly.
"CDC transformed how we think about data. We went from asking 'What happened yesterday?' to 'What's happening right now?' That mindset shift unlocked opportunities we hadn't even imagined when we started." β Sarah Chen, VP of Data Engineering, ShopStream
Part 6: What's Next for ShopStream
With the CDC pipeline successfully deployed, ShopStream is now exploring advanced patterns:
- Event-Driven Microservices: Migrating their monolithic order service to microservices using the Transactional Outbox pattern
- Real-Time Personalization: Feeding customer behavior events into a feature store for ML-powered recommendations
- Multi-Region Replication: Using CDC to replicate data to a new EU datacenter for GDPR compliance and lower latency
- Data Mesh: Enabling domain teams to own their data products, with Kafka as the backbone for data sharing
Key Takeaways
π― Business Value First
ShopStream didn't adopt CDC for technical eleganceβthey did it to solve real business problems. The Black Friday incident gave them a clear ROI story to tell leadership.
π§ Right Tool for the Job
They chose Debezium despite higher complexity because it aligned with their strategic direction. For teams without Kafka skills or different constraints, Fivetran or DMS might be the better choice.
π Incremental Migration
Table-by-table migration reduced risk and allowed learning from each phase. Running dual systems briefly was worth the redundancy for peace of mind.
π Observability Is Non-Negotiable
End-to-end monitoring caught issues that component metrics missed. The canary record pattern and freshness checks were game-changers.
π Documentation & Training
Runbooks, schema evolution guides, and failure drills made the team resilient. Investing in knowledge sharing paid dividends when incidents occurred.
π CDC as a Platform
The real win wasn't just replacing ETLβit was building an event streaming platform that enabled 18 new use cases in 6 months. Think of CDC as infrastructure, not a point solution.
Case Study Knowledge Check
Test your understanding of real-world CDC implementation patterns and lessons learned.
What is a common trigger for organizations to migrate from batch ETL to CDC?
Organizations typically adopt CDC when: (1) business units demand real-time dashboards and can't wait for nightly ETL, (2) competitors are gaining advantage with fresher data, (3) batch jobs are failing frequently, missing SLAs, or overwhelming source databases, or (4) scaling batch processes becomes economically unfeasible.
Review the correct answer and explanation.
In a typical CDC implementation, what is a key decision point when choosing between push and pull patterns?
Push patterns (triggers, application-level outbox) are easier to implement but add load and complexity to the source database. Pull patterns (log-based CDC like Debezium) are more complex to set up but have minimal source impact and capture all changes transparently. The choice depends on source database capabilities, performance constraints, and operational expertise.
Review the correct answer and explanation.
What lesson is commonly learned during the initial snapshot phase of a CDC rollout?
The initial snapshot is often underestimated. Large tables can take days to snapshot, during which transaction logs must be retained. Teams learn to: (1) set adequate log retention before starting, (2) chunk/paginate large tables to reduce memory pressure, (3) snapshot during low-traffic windows, and (4) test the snapshot-to-stream handoff thoroughly.
Review the correct answer and explanation.
What is a typical challenge when integrating CDC into existing data warehouse architectures?
Data warehouses optimized for batch loads (append-only, full refreshes) may need refactoring: adding MERGE/UPSERT support, changing to SCD Type 2 for history tracking, handling late-arriving events, or building reconciliation jobs. Transformations that assumed complete snapshots must adapt to incremental changes, requiring pipeline redesign.
Review the correct answer and explanation.
What operational lesson is often learned after deploying CDC to production?
Post-deployment, teams learn that CDC is not 'set and forget.' Critical operational practices include: monitoring consumer lag and error rates, setting up alerting, documenting offset reset procedures, practicing replay drills, maintaining log retention policies, and establishing runbooks for common failures. Operational maturity is as important as the initial implementation.
Review the correct answer and explanation.