Advanced

Handling Schema Evolution in CDC

Keep downstream systems humming as source schemas change. Learn how schema registries, compatibility modes, and message contracts safeguard your pipelines from drift.

Source database schemas are not static. Over the lifecycle of an application, developers will inevitably make changes: adding new columns, removing old ones, or modifying data types. This phenomenon, known as schema drift or schema evolution, is a primary cause of brittle data pipelines. A pipeline not designed for this will break the moment an incoming change event no longer matches what downstream consumers expect.

The Solution: The Schema Registry (An API Contract for Your Data)

The industry-standard solution is to treat your data's schema like a formal API contract. This is managed by a Schema Registry. Think of it as the 'OpenAPI' or 'Swagger' for your event streams. It's a centralized service that ensures all data producers and consumers agree on the 'shape' of the data, even as that shape evolves over time.

By using a schema-aware format like Apache Avro or Protobuf, producers and consumers can be decoupled, allowing them to evolve independently without breaking the pipeline.

Schema evolution flow showing schema registry, converters, and sink adjustments
Schema Registry mediates evolution between producers and consumers with compatibility checks

Choosing Your Contract Strategy: Compatibility Modes

A schema registry isn't just a database of schemas; it actively enforces rules. When a producer tries to register a new schema version, the registry checks it against a configured compatibility mode. Understanding these modes is critical for operations.

Backward Compatibility

(New schemas can read old data)

This mode allows you to delete fields and add optional new fields (with a default value). Consumers can upgrade at their own pace without breaking when they encounter old data.

Use this for most standard use cases. It offers the most operational flexibility.

Forward Compatibility

(Old schemas can read new data)

This mode allows you to add new fields and delete optional old fields. This is less common and requires careful coordination, as you must upgrade all consumers before producers start sending data with the new schema.

Full Compatibility

(Both backward and forward)

Ensures old consumers can read new data and new consumers can read old data. Safest—but most restrictive. Good for long-lived topics with many consumers.

Transitive vs Latest

Transitive checks a new schema against all previous versions. Latest checks only against the most recent version. Prefer transitive for critical topics to avoid “drift over time.”

For most CDC pipelines, start with Backward (often transitive). It balances safety with agility, letting consumers upgrade at their own pace while preventing breaking producer changes.

Subject naming & scope

Compatibility is enforced per subject (a Registry namespace). Common strategies:

Pick once per topic family—changing later is painful.

The Data Flow in a Schema-Aware Pipeline

Producer
Schema Registry
Consumer
Register user_v1
Store & return id=1
Serialize with id=1 → publish
Read message (id=1)
Return schema 1
If not cached: request id=1 → cache → deserialize

Upgrade to v2

Producer
Schema Registry
Consumer
Register user_v2
Compat check → id=2
Publish id=2
If unseen: fetch id=2 → cache → deserialize

This flow, mediated by the schema registry, allows a producer to upgrade to a `v2` schema and start sending messages with ID `2`. An old consumer that only knows about ID `1` can either fail gracefully or, if the schemas are compatible, process the new message by ignoring the new fields.

Concrete evolution examples

Avro (v1 → v2)

// v1
{ "type":"record","name":"User","namespace":"demo",
  "fields":[
    {"name":"id","type":"long"},
    {"name":"email","type":"string"}
  ] }
// v2 (backward-compatible): add optional with default; rename with alias
{ "type":"record","name":"User","namespace":"demo",
  "fields":[
    {"name":"id","type":"long"},
    {"name":"email","type":"string"},
    {"name":"full_name","type":["null","string"],"default":null,
     "aliases":["name"]} ] }

OK under backward/backward_transitive. Dropping a required field would violate it.

Protobuf (field numbers matter)

// v1
message User { int64 id = 1; string email = 2; }
// v2 (safe): add new optional field, keep numbers stable
message User { int64 id = 1; string email = 2; string full_name = 3; }

Never reuse or renumber fields; removing a field keeps the tag reserved.

JSON Schema (tolerant readers)

{ "$schema":"https://json-schema.org/draft/2020-12/schema",
  "type":"object",
  "properties":{
    "id":{"type":"integer"},
    "email":{"type":"string"},
    "full_name":{"type":["string","null"]}
  },
  "required":["id","email"],
  "additionalProperties": true }

Set additionalProperties per policy. Many JSON consumers ignore unknown fields—document the expectation.

Safe rollout playbooks

Registry operations (quick refs)

# Set per-subject compatibility (examples)
curl -X PUT -H "Content-Type: application/json" \
  --data '{"compatibility":"BACKWARD_TRANSITIVE"}' \
  $REGISTRY_URL/config/your-subject

# Get the current compatibility
curl $REGISTRY_URL/config/your-subject

Automate these in CI so topics can’t drift to unsafe modes.

CDC-specific wrinkles

Consumer defenses

Audit & test
  • Contract tests: validate proposed schemas against real consumer fixtures.
  • Canary topics: mirror a subset of prod through the new schema and compare sink MERGE row counts/checksums.

Schema Evolution Knowledge Check

Test your understanding of schema compatibility and evolution strategies.

Q1

What is the primary purpose of a schema registry in CDC?

Q2

What does 'backward compatibility' mean in schema evolution?

Q3

What does 'forward compatibility' mean in schema evolution?

Q4

Why should you avoid removing required fields in schema evolution?

Q5

What is a common strategy when you need to make a breaking schema change?

0/5 correct
Progress 0% No progress yet
Progress is stored locally in this browser.