Checkpoints

When building pipelines, you will often use shared transform nodes between multiple outputs. This logic is typically recomputed once for each output. With checkpoints in Pipeline Builder, you can mark transform nodes as “checkpoints” to save intermediate results during your next build. Logic upstream of that checkpoint node will be computed only once for all of its shared outputs, saving compute resources and decreasing build times.

Checkpoints are only available in batch pipelines. Outputs must be in the same job group for checkpoint nodes to improve pipeline efficiency. Learn more about job groups in Pipeline Builder.

Add a checkpoint node

Below is an example pipeline that produces two outputs: Attachment and Request. The transform node Checkpoint is shared between the two outputs. In this current state, the logic nodes Clean and Checkpoint would be computed twice, once for each output.

A checkpoint node on a graph in Pipeline Builder

However, we want to only compute Clean and Checkpoint once for both outputs. To do this, right-click on Checkpoint and select Mark as checkpoint.

Select Mark as checkpoint near the bottom of the node menu.

A light blue badge will now appear in the top corner of the Checkpoint node.

The checkpoint node is now marked as a checkpoint on the graph.

Now, add both outputs to the same job group to verify the checkpoint behavior. Right-click one of the outputs (Request) to Assign job group. Choose New group to open the Build settings panel.

Use the node menu to assign an output node to a job group.

Since the datasets are in different job groupings by default, the checkpoint will be recomputed for each output, negating any benefits. To fix this, add the other output (Attachment) to the same job group by selecting the output, and then selecting Add to group... at the bottom of the panel.

Add another output node to the same job group in the Build Settings panel.

Learn more about configuring nodes in Pipeline Builder with color groups and job groups.

Checkpoint storage costs

Checkpoints push the entire result of a transform to storage, such as the Hadoop Distributed File System (HDFS). For example, if you checkpoint a join, the entire result of the join will be output to storage. This can result in a large amount of data being stored, even if the dataset output is small.