Schema Evolution | CDC: The Missing Manual

Source database schemas are not static. Over the lifecycle of an application, developers will inevitably make changes: adding new columns, removing old ones, or modifying data types. This phenomenon, known as schema drift or schema evolution, is a primary cause of brittle data pipelines. A pipeline not designed for this will break the moment an incoming change event no longer matches what downstream consumers expect.

The Solution: The Schema Registry (An API Contract for Your Data)

The industry-standard solution is to treat your data's schema like a formal API contract. This is managed by a Schema Registry. Think of it as the 'OpenAPI' or 'Swagger' for your event streams. It's a centralized service that ensures all data producers and consumers agree on the 'shape' of the data, even as that shape evolves over time.

By using a schema-aware format like Apache Avro or Protobuf, producers and consumers can be decoupled, allowing them to evolve independently without breaking the pipeline.

Schema evolution flow showing schema registry, converters, and sink adjustments — Schema Registry mediates evolution between producers and consumers with compatibility checks

Choosing Your Contract Strategy: Compatibility Modes

A schema registry isn't just a database of schemas; it actively enforces rules. When a producer tries to register a new schema version, the registry checks it against a configured compatibility mode. Understanding these modes is critical for operations.

Backward Compatibility

(New schemas can read old data)

This mode allows you to delete fields and add optional new fields (with a default value). Consumers can upgrade at their own pace without breaking when they encounter old data.

Use this for most standard use cases. It offers the most operational flexibility.

Forward Compatibility

(Old schemas can read new data)

This mode allows you to add new fields and delete optional old fields. This is less common and requires careful coordination, as you must upgrade all consumers before producers start sending data with the new schema.

Full Compatibility

(Both backward and forward)

Ensures old consumers can read new data and new consumers can read old data. Safest—but most restrictive. Good for long-lived topics with many consumers.

Transitive vs Latest

Transitive checks a new schema against all previous versions. Latest checks only against the most recent version. Prefer transitive for critical topics to avoid “drift over time.”

For most CDC pipelines, start with Backward (often transitive). It balances safety with agility, letting consumers upgrade at their own pace while preventing breaking producer changes.

Subject naming & scope

Compatibility is enforced per subject (a Registry namespace). Common strategies:

TopicNameStrategy: <topic>-value (simple, one record type per topic).
RecordNameStrategy: record’s full name is the subject (lets multiple record types share a topic).
TopicRecordNameStrategy: combines both (avoids name clashes across topics).

Pick once per topic family—changing later is painful.

The Data Flow in a Schema-Aware Pipeline

Producer

Schema Registry

Consumer

Store & return id=1

—

Serialize with id=1 → publish

—

Read message (id=1)

—

Return schema 1

If not cached: request id=1 → cache → deserialize

Upgrade to v2

Producer

Schema Registry

Consumer

Compat check → id=2

—

Publish id=2

—

If unseen: fetch id=2 → cache → deserialize

This flow, mediated by the schema registry, allows a producer to upgrade to a `v2` schema and start sending messages with ID `2`. An old consumer that only knows about ID `1` can either fail gracefully or, if the schemas are compatible, process the new message by ignoring the new fields.

Concrete evolution examples

Avro (v1 → v2)

// v1
{ "type":"record","name":"User","namespace":"demo",
  "fields":[
    {"name":"id","type":"long"},
    {"name":"email","type":"string"}
  ] }
// v2 (backward-compatible): add optional with default; rename with alias
{ "type":"record","name":"User","namespace":"demo",
  "fields":[
    {"name":"id","type":"long"},
    {"name":"email","type":"string"},
    {"name":"full_name","type":["null","string"],"default":null,
     "aliases":["name"]} ] }

OK under backward/backward_transitive. Dropping a required field would violate it.

Protobuf (field numbers matter)

// v1
message User { int64 id = 1; string email = 2; }
// v2 (safe): add new optional field, keep numbers stable
message User { int64 id = 1; string email = 2; string full_name = 3; }

Never reuse or renumber fields; removing a field keeps the tag reserved.

JSON Schema (tolerant readers)

{ "$schema":"https://json-schema.org/draft/2020-12/schema",
  "type":"object",
  "properties":{
    "id":{"type":"integer"},
    "email":{"type":"string"},
    "full_name":{"type":["string","null"]}
  },
  "required":["id","email"],
  "additionalProperties": true }

Set additionalProperties per policy. Many JSON consumers ignore unknown fields—document the expectation.

Safe rollout playbooks

Read-new / Write-old (RN/WO): upgrade consumers first to tolerate the new schema; then upgrade producers.
Two-phase add: add optional fields with defaults → backfill downstream → switch producers to start populating.
Type widening: e.g., int→long or string→[string,null]; avoid narrowing.
Removals: deprecate → stop usage → drop only if policy allows and all consumers have moved.

Registry operations (quick refs)

# Set per-subject compatibility (examples)
curl -X PUT -H "Content-Type: application/json" \
  --data '{"compatibility":"BACKWARD_TRANSITIVE"}' \
  $REGISTRY_URL/config/your-subject

# Get the current compatibility
curl $REGISTRY_URL/config/your-subject

Automate these in CI so topics can’t drift to unsafe modes.

CDC-specific wrinkles

New source columns: existing rows will emit null until updated or backfilled—plan materialization defaults.
Rename vs drop: prefer add new + alias (Avro) and keep the old field until consumers migrate.
Envelope stability: keep event envelope fields (op/ts_ms/transaction/key) stable; evolve payload separately.

Consumer defenses

Ignore unknown fields: choose deserializers/options that tolerate extras.
Defaulting: when a field is missing, apply a deterministic default (document it!).
Quarantine on hard breaks: route to DLQ with schema id and error for later replay.

Audit & test

Contract tests: validate proposed schemas against real consumer fixtures.
Canary topics: mirror a subset of prod through the new schema and compare sink MERGE row counts/checksums.

Handling Schema Evolution in CDC

The Solution: The Schema Registry (An API Contract for Your Data)

Choosing Your Contract Strategy: Compatibility Modes

Backward Compatibility

Forward Compatibility

Full Compatibility

Transitive vs Latest

Subject naming & scope

The Data Flow in a Schema-Aware Pipeline

Upgrade to v2

Concrete evolution examples

Avro (v1 → v2)

Protobuf (field numbers matter)

JSON Schema (tolerant readers)

Safe rollout playbooks

Registry operations (quick refs)

CDC-specific wrinkles

Consumer defenses

Schema Evolution Knowledge Check

What is the primary purpose of a schema registry in CDC?

What does 'backward compatibility' mean in schema evolution?

What does 'forward compatibility' mean in schema evolution?

Why should you avoid removing required fields in schema evolution?

What is a common strategy when you need to make a breaking schema change?

Let’s Talk CDC Interactive Dashboard