Breaking changes

Breaking changes occur when stateful functions are modified in streaming or incremental pipelines. Transforms are either row-level or stateful.

  • Row level transform: Only requires data in a single row to produce a result, for example Multiply numbers or Filter.
  • Stateful function: A transform that requires data across multiple rows to produce a result.

There are four main stateful functions:

  • Aggregate (Aggregate over Window in Streaming)
  • Outer caching join (only in streaming)
  • Heartbeat detection (only in streaming)
  • Time bounded drop duplicates (only in streaming)
  • Time bounded event time sort (only in streaming)

When a stateful function is modified, the previous output may no longer be accurate. For example, imagine you are filtering to even numbers and taking the sum of that set. If you change the filter to be all odd numbers, the existing state will be the sum of even numbers, but all new filtered values will be odd. Therefore, what the sum represents is now ambiguous, being the sum of a set of even numbers added to the sum of a set of odd numbers. To refresh the state, you can run a replay.

There are two types of replays:

  • Replay from start of input data: Replays your pipeline from the start of data, either the start of a stream or the first transaction on an input dataset as determined by whether the input is a stream or an incremental dataset.

The Deploy panel with the replay strategy to replay from the start of input data.

  • Replay from amount of time ago (only available for Streaming): Replay the pipeline using upstream data starting from a specified amount of time ago. The granular replay will include all data starting with the first transaction that committed before the time specified, all data before that will not be processed. This means you may get one transaction's worth of data from before the time you specify.

The Deploy panel with the replay strategy to replay from amount of time ago.

Replays can be optional or required; in the case of breaking changes, Pipeline Builder automatically detects this change and requires a replay on deploy. The image below shows a forced replay in an Incremental pipeline.

The Deploy panel with a forced replay due to breaking changes.

Replaying your pipeline could lead to lengthy downtimes, possibly as long as multiple days. When you replay your pipeline, your stream history will be lost and all downstream pipeline consumers will be required to replay.