Incremental pipelines are often used to process input datasets that change significantly over time. By avoiding unnecessarily computing on all the rows or files of data that have not changed, incremental pipelines enable lower end-to-end latency while minimizing compute costs.
However, incremental pipelines carry additional development and maintenance complexity that you should be aware of before getting started.
Here are some of the factors related to incremental pipelines you may want to consider:
APPEND
transactions, your logic must be resilient to input datasets occasionally being recomputed, which results in a SNAPSHOT
transaction. Ideally, your transformation logic should be written with thorough unit tests to validate behavior before this happens in practice.APPEND
transactions are applied, causing datasets to consist of a large volume of small files. This includes understanding how Spark handles large numbers of files and how this affects Spark partitioning. Read more about maintaining high performance for incremental pipelines.Get started with incremental pipelines by reviewing the following recommended resources: