Breaking changes

Breaking changes occur when stateful functions are modified in streaming or incremental pipelines. Transforms are either row-level or stateful.

  • Row level transform: Only requires data in a single row to produce a result, for example Multiply numbers or Filter.
  • Stateful function: A transform that requires data across multiple rows to produce a result.

There are four main stateful functions:

  • Aggregate (Aggregate over Window in Streaming)
  • Outer caching join (only in streaming)
  • Heartbeat detection (only in streaming)
  • Time bounded drop duplicates (only in streaming)
  • Time bounded event time sort (only in streaming)

When a stateful function is modified, the previous output may no longer be accurate. For example, imagine you are filtering to even numbers and taking the sum of that set. If you change the filter to be all odd numbers, the existing state will be the sum of even numbers, but all new filtered values will be odd. Therefore, what the sum represents is now ambiguous, being the sum of a set of even numbers added to the sum of a set of odd numbers. To refresh the state, you can run a replay.

There are two types of replays:

  • Replay from start of input data: Replays your pipeline from the start of data, either the start of a stream or the first transaction on an input dataset as determined by whether the input is a stream or an incremental dataset.

The Deploy panel with the replay strategy to replay from the start of input data.

  • Replay from amount of time ago (only available for Streaming): Replay the pipeline using upstream data starting from a specified amount of time ago. The granular replay will include all data starting with the first transaction that committed before the time specified, all data before that will not be processed. This means you may get one transaction's worth of data from before the time you specify.

The Deploy panel with the replay strategy to replay from amount of time ago.

Replays can be optional or required; in the case of breaking changes, Pipeline Builder automatically detects this change and requires a replay on deploy. The image below shows a forced replay in an Incremental pipeline.

The Deploy panel with a forced replay due to breaking changes.

Replaying your pipeline could lead to lengthy downtimes, possibly as long as multiple days. When you replay your pipeline, your stream history will be lost and all downstream pipeline consumers will be required to replay.

State-preserving modifications

Pipeline Builder includes features that allow certain pipeline modifications without a replay. These features enable you to continue processing from where you left off, preserving your stream history and avoiding impact to downstream consumers.

Input and output modifications

You can add or remove inputs and outputs without triggering a full replay.

  • Adding inputs: New inputs are read from the beginning, while existing inputs continue from their last processed position.
  • Removing inputs and outputs: The state associated with removed inputs or outputs is dropped from the processing cluster without requiring a replay.

When Pipeline Builder detects input or output changes, a state-break module prompts you to acknowledge the change. This acknowledgment tells the system to continue processing from where it left off rather than requiring a replay.

The state-break acknowledgment dialog for input and output changes.

Schema changes

Input schemas are pinned when you deploy your pipeline. If an input schema changes, the pipeline continues reading data using the previous schema until you manually redeploy.

Selective data re-ingestion

You can re-ingest data from a specific point in time without resetting output views. When you choose to re-ingest, all data present in the outputs at the time of re-ingestion is preserved, allowing you to reprocess historical data while maintaining your existing output state.

To configure this behavior, expand the Advanced section in the deploy panel and disable the Reset Outputs on replay option when replaying your pipeline.

The Advanced section in the deploy panel showing the option to preserve output views during replay.