Change data capture (CDC)

Change data capture ↗ (CDC) is an enterprise data integration pattern often used to stream real-time updates from a relational database to other consumers.

Foundry supports syncing, processing, and storing data from systems that produce change data capture feeds. Components throughout the platform, including data connectors, streams, pipelines, and the Foundry Ontology can all work natively with this changelog data using the metadata configured on the schema of the underlying data.

Changelog metadata requires the following attributes on the data:

  • One or more primary key columns
    • Changelog data will contain many entries with identical primary key columns, as changelogs are meant to convey every individual change that occurred on the record with that primary key. Change data capture is particularly useful for data that is being edited, rather than immutable or append-only data feeds.
  • One or more ordering columns
    • The ordering column(s) must be numeric and are used to determine the relative order of changes for records with a given set of primary keys. Ordering columns are frequently timestamps represented as longs, however this is not always the case. The largest value is understood to be the most recent, even if the order columns are not timestamps.
  • A deletion column
    • The deletion column is used to determine if a given update deleted the record with the given set of primary keys. This must be a boolean that is true if the record was deleted, else it should be false.

This changelog metadata is used to specify a resolution strategy to resolve the changelog down to a current view of the data in the source system. The resolution strategy for change data capture works as follows:

  1. Group the data based on the primary key column(s).
  2. Find the entry with the largest value in the ordering column(s).
  3. If this entry has the value true in the deletion column, remove it.

After these steps, the changelog should now be collapsed down to the same view as you would see in the upstream source system generating the changelog. The changelog metadata and resolution strategy is used in different ways by various Palantir components, as described below.

Change data capture in Data Connection

Sources available in Data Connection may support syncing data from systems as a stream with changelog metadata applied. The ability for a source to implement the capability to sync changelog data is dependent on the ability of the source system to produce log data with the required attributes (primary key column(s), ordering column(s), and a deletion column).

Many commonly used databases including MySQL, PostgreSQL, Oracle, Db2, and SQL Server can produce CDC changelog data. Data connectors in Foundry for these systems may or may not support directly syncing these changelogs. Even if a connector does not currently support CDC syncs, you can still sync data via Kafka or other streaming middleware and use as a changelog once the data arrives in Foundry.

The following source types in Data Connection currently support CDC syncs:

  • Microsoft SQL Server
  • PostgreSQL

Additionally, CDC syncs are available for the following systems through an experimental connector. If you want to use this connector, contact your Palantir representative.

  • Oracle
  • MySQL

If you have changelog-shaped data available from other sources, such as Kafka, Kinesis, Amazon SQS, or push-based integration with streams, read how to manually configure changelog metadata using the key-by board in Pipeline Builder.

Enable change data capture on an external database

To sync changelog feeds from a supported source type, you must have the correct settings enabled for the relevant tables and, in some cases, for the entire database.

For example, to enable change data capture for Microsoft SQL Server, you must run a command to enable CDC on the database:

USE <database>
GO
EXEC sys.sp_cdc_enable_db
GO

Then, run another command on each table that should be recording changelogs:

EXEC sys.sp_cdc_enable_table
    @source_schema = N'<schema>'
  , @source_name = N'<table_name>'
  , @role_name = NULL
  , @capture_instance = NULL
  , @supports_net_changes = 0
  , @filegroup_name = N'PRIMARY';
GO

The above examples provide a high level explanation of what may be required to enable a source system to produce changelog data. Specific information on how to enable change data capture logs for a given system can be found in documentation provided by that system.

Set up a source connection for change data capture

Now, you can configure a source in Data Connection that connects to the system from which you want to capture changelogs. Continuing with our example, set up a connection using the Microsoft SQL Server connector, as shown below:

A new source page for CDC data in Data Connection.

Create a change data capture sync

On the source overview page, you will find an empty table of CDC Syncs.

The CDC Syncs section on the overview page of an example Microsoft SQL Server source, outlined in a red rectangle.

Select Create CDC sync to add a new change data capture sync. Specify the table you wish to sync, and the following information will be automatically derived by the connector:

  • The schema for the output streaming dataset.
  • Changelog metadata, including primary key columns, ordering columns, and a deletion column.

The new sync configuration page, with sections to edit the tables, schema, and primary key.

As with other streams, you must specify the expected throughput at creation. The throughput cannot be changed after sync creation; ensure that your stream is configured to support the expected volume of changelog data. The default volume is 5MB/s, which is typically more than required for most change data capture workflows. Since changes in relational databases are often produced at “human scale”, the volume and frequency of changes is much smaller than what is possible with “machine scale” sensor or equipment data.

Save the sync configuration to create a new stream in the specified output location.

Currently, the CDC job must be manually (re)started after any changes. All CDC syncs run as a single multi-output extract job, meaning that any existing CDC streams from the same source must be briefly stopped whenever a feed is added or removed. The streams will gracefully catch up with any data that was changed while the streaming sync job was stopped.

After starting the output stream, changelog data should begin flowing and appear in the live view of the output streaming dataset.

Change data capture in streams

Streams with changelog metadata will display two views into the data:

  • A live view showing the fully expanded list of changelog entries.
  • An archive view where the data is resolved according to the resolution strategy and collapsed down to the current latest view of the data.

The dropdown menu to switch between live and archive views, located in the upper left corner of the page.

The schema will display changelog metadata as a primary key resolution strategy in the details view.

Streaming change data capture schema with resolution strategy.

Streams currently use the ordering column to perform resolution. This means that data will be resolved in the archive according to the provided ordering column, even if the data was received out of order. This behavior differs from the behavior of changelog data in the Foundry Ontology, which indexes data based on the order of arrival in the stream used to back the object type.

Change data capture in Pipeline builder

Streaming transformations in Pipeline Builder may be used to process changelog data. As long as the metadata columns are not modified by the transformations, changelog metadata will be automatically propagated on any outputs.

If the input stream did not have changelog metadata or the metadata columns were transformed by the pipeline, you may use the “key by” board to “key” the data and apply changelog metadata on the output(s).

The Key By board in Pipeline Builder.

Change data capture in the Ontology

The Ontology uses changelog metadata to index data into object types in Object Storage V2 that are backed by a streaming data source. The data arriving in the stream is resolved and indexed into a latest current view that is available when querying the Ontology (to display in a Workshop module, for example).

If retention is configured on a data source with changelog metadata, any records that do not receive updates within the retention window time will disappear from the Ontology.

The Ontology currently ignores the ordering column(s) specified in the changelog metadata. Instead, Object Storage V2 indexes data based on the order it arrives in the backing data source. Concretely, this means that for a given primary key, if a log entry with an ordering value of 2 arrives at t1 and data_column=foo, followed by another log entry with an ordering value of 1 that arrives at t2 with data_column=bar, the record will appear with data_column=bar even though in the source system the most recent value is data_column=foo. This can cause the Ontology to incorrectly reflect the data in the source system if data arrives out of order.

Since connectors used with Palantir are guaranteed to deliver data in order, and Foundry streams maintain ordering, this Ontology behavior will likely only affect custom setups or older streaming changelogs that are manually backfilled and not re-ordered before syncing. If you encounter this situation, we recommend applying a transformation to reorder the data in Pipeline Builder before syncing to the Ontology.

Change data capture in Workshop

Workshop supports auto-refresh to display the frequently updating data as soon as it is available in the Ontology. Auto-refresh is compatible with CDC and may be used to ensure that any data streamed in with change data capture is promptly available in Workshop applications.

Auto-refresh is available for any data that is expected to update frequently while the Workshop module is open. Data is not required to have changelog metadata on the backing data source to use auto-refresh.

Considerations when using change data capture

We recommending reviewing the following information on backfills, outages, and other known limitations before using CDC workflows.

Backfill

All changelog syncs are handled on an exclusively "going forward" basis; no automatic backfill is performed.

Often, a full backfill of changelogs is not possible, since most systems do not enable CDC by default. Even if changelogs are enabled, most systems include a retention period, after which changlogs are permanently deleted and no longer recoverable.

If a full backfill is required, we recommend the following:

  1. Set up a CDC stream on a "going forward" basis.
  2. Perform a batch sync to extract the desired historical data.
  3. Convert the historical batch data into a stream of “create” records for each primary key, then merge that stream into the CDC stream.

Backfills may result in data that is out of order, and you may need to manually reorder or replay streams to properly prepare data for syncing to the Ontology.

Outages

You may encounter the following outages when using a CDC workflow:

  • Network connectivity between the source database and the Foundry agent
  • Network connectivity between the Foundry agent host and Foundry
  • Foundry outage
  • Database outage
  • Agent outage

Outages are handled gracefully if the retention window on replication logs and changelogs in the database are configured to be longer than the maximum expected outage.

For example, if the connection to Foundry goes down for several hours, and the log retention window on Microsoft SQL Server is set to one day, the database will continue recording changelog entries. Foundry will gracefully catch up once it is back online and reconnected to Microsoft SQL Server. Since no new data will flow until the queue of changelog entries is cleared, there may be some lag before changes are again flowing at near real time into Palantir.

Use changelog-shaped data from non-changelog sources

Data does not need to be ingested as a changelog to use CDC workflows. Any streaming data in Palantir may be “keyed” with changelog metadata and then used as CDC data in workflows after syncing to the Ontology.

This means, for example, that push-based ingestion using stream proxy may be used to manually push changelog-shaped records into a stream.

Similarly, if changelog data is available in a Kafka topic, it may be ingested using a standard (non-CDC) sync. Then, it can be “keyed’ using Pipeline Builder and used in the Ontology and beyond.

Remove changelog metadata

Sometimes, it may be useful to remove changelog metadata. For example, you might want to remove metadata to analyze the process flow captured by the changelog. To remove changelog metadata, use one of the following methods:

  • Perform any transformation on the metadata columns
  • Manually remove the resolution strategy from the schema of the streaming dataset