When building pipelines, you will often use shared transform nodes between multiple outputs. This logic is typically recomputed once for each output. With checkpoints in Pipeline Builder, you can mark transform nodes as “checkpoints” to save intermediate results during your next build. Logic upstream of that checkpoint node will be computed only once for all of its shared outputs, saving compute resources and decreasing build times.
Checkpoints are only available in batch pipelines. Outputs must be in the same job group for checkpoint nodes to improve pipeline efficiency. Learn more about job groups in Pipeline Builder.
Below is an example pipeline that produces two outputs: Attachment
and Request
. The transform node Checkpoint
is shared between the two outputs. In this current state, the logic nodes Clean
and Checkpoint
would be computed twice, once for each output.
However, we want to only compute Clean
and Checkpoint
once for both outputs. To do this, right-click on Checkpoint
and select Mark as checkpoint.
A light blue badge will now appear in the top corner of the Checkpoint
node.
Now, add both outputs to the same job group to verify the checkpoint behavior. Right-click one of the outputs (Request
) to Assign job group. Choose New group to open the Build settings panel.
Since the datasets are in different job groupings by default, the checkpoint will be recomputed for each output, negating any benefits. To fix this, add the other output (Attachment
) to the same job group by selecting the output, and then selecting Add to group... at the bottom of the panel.
Learn more about configuring nodes in Pipeline Builder with color groups and job groups.
Checkpoints push the entire result of a transform to storage, such as the Hadoop Distributed File System (HDFS). For example, if you checkpoint a join, the entire result of the join will be output to storage. This can result in a large amount of data being stored, even if the dataset output is small.