Overview

Incremental pipelines are often used to process input datasets that change significantly over time. By avoiding unnecessarily computing on all the rows or files of data that have not changed, incremental pipelines enable lower end-to-end latency while minimizing compute costs.

However, incremental pipelines carry additional development and maintenance complexity that you should be aware of before getting started.

Background

Here are some of the factors related to incremental pipelines you may want to consider:

  • Developing an incremental pipeline requires a thorough understanding of how datasets change over time in Foundry using transactions. You will need to interact with the concepts of dataset transactions in Data Connection syncs and transformation logic to effectively create and manage an incremental pipeline over time.
  • Once you understand how transactions work in Foundry, you will need to design your pipeline to be resilient to unexpected transactions in your input datasets. Although incremental pipelines generally only process changed data that arrives in the form of APPEND transactions, your logic must be resilient to input datasets occasionally being recomputed, which results in a SNAPSHOT transaction. Ideally, your transformation logic should be written with thorough unit tests to validate behavior before this happens in practice.
  • To ensure incremental pipelines remain performant in the long run, you will need to understand how datasets change over time when many APPEND transactions are applied, causing datasets to consist of a large volume of small files. This includes understanding how Spark handles large numbers of files and how this affects Spark partitioning. Read more about maintaining high performance for incremental pipelines.

Getting started

Get started with incremental pipelines by reviewing the following recommended resources: