4. Scheduling Data Pipelines3. Defining What Your Schedule Will Build

3 - Defining what your Schedule will Build

This content is also available at learn.palantir.com ↗ and is presented here for accessibility purposes.

📖 Task Introduction

Schedule targets represent the terminal points for a given schedule and are built on the branch set in the top right of the Data Lineage application.

🔨 Task Instructions

  1. Click the flight_alerts_clean dataset note on the Data Lineage graph and note the options available. In a Connecting Build, the UI presents three "WHAT" options and one "WHEN" (and a Clear button to remove any designation).

    • Input: These datasets are not built but used as inputs to the subsequent step in the pipeline. Remember that a Connecting Build builds all datasets between the inputs and triggers, excluding the inputs but including the targets.
    • Target: The final datasets to be built in a schedule.
    • Excluded: Use this option if there are datasets between the inputs and targets that you do not want to execute as part of your pipeline.
    • Trigger: Discussed in the next task, designating a dataset as a trigger makes it a condition for executing your pipeline.
  2. Choose Target. The flight_alerts_clean dataset now appears in the Target datasets section of the Scheduler window.

  3. Hold shift and drag a selection box around your raw/flight_alerts_raw, priority_mapping_raw, and status_mapping_raw and select them as Inputs. This means that they will not build when the schedule triggers, but they'll be used as inputs to the builds downstream.

    • The three selected datasets now appear in the Input dataset section of the Scheduler panel and the preprocessed dataset nodes are now blue. As you can see in the legend, the schedule will attempt to build these datasets as they are between the inputs and the targets (hence it creates a “connecting” build).

ℹ️ Why didn’t we select the (simulated) data source as the input? Recall the grouped datasets furthest upstream are merely simulating external datasources and the datasets you just marked as Inputs are simulating the raw table copies from those external sources. You should configure Data Connection sync schedules separately from the rest of a build using the Force Build option discussed later in this tutorial.