Create a dataset batch pipeline with Pipeline Builder

In this tutorial, we will use Pipeline Builder to create a simple pipeline with an output of a single dataset with information on flight alerts. We can then analyze this output dataset with tools like Contour or Code Workbook to answer questions such as which flight paths have the greatest risk of disruption.

The datasets used below are searchable by name in the dataset import step, and can be found in the Foundry Reference Project in your Foundry filesystem: Foundry Training and Resources/Foundry Reference Project/Tutorial Reference Examples/Track: Data Engineering/Datasource Project: Flight Alerts/datasets.

At the end of this tutorial, you will have a pipeline that looks like the following:

Screenshot of complete Pipeline builder

The pipeline will produce a new dataset output of Flight Alerts Data, which can be used for further exploration.

Part 1: Initial setup

First, we need to create a new pipeline.

  1. When logged into Foundry, access Pipeline Builder from the left-side navigation bar under Apps. If it isn’t there, click View all and find Pipeline Builder under the Build & Monitor Pipelines section.

    Screenshot of Pipeline builder link on navigation bar

  2. Next, on the top right of the Pipeline Builder landing page, create a new pipeline by clicking New pipeline. Select Batch pipeline.

    Screenshot of Pipeline selection

The ability to create a streaming pipeline is not available on all Foundry environments. Contact your Palantir representative for more information if your use case requires it.

  1. Select a location to save your pipeline. Note that pipelines cannot be saved in personal folders. Screenshot of Choose pipeline location popover
  2. Click Create pipeline.

Part 2: Add datasets

Now we can add datasets to our pipeline workflow. For this tutorial, we will use sample datasets of notional or open-source data, and all datasets should be available as part of the Foundry Reference Project in your Foundry filesystem.

From the Pipeline Builder page, click Add datasets from Foundry.

Screenshot of Choose pipeline location popover

Alternatively, you can drag-and-drop a file from your computer to use as your dataset.

In our walkthrough example, we will add the passengers_preprocessed, flight_alerts_raw, and status_mapping_raw datasets. To add a selection of datasets, select the dataset and click the + icon inline, or click Add to Selection.

Screenshot of Add dataset from location popover

When all datasets required are selected, click Add datasets.

Screenshot of Choose pipeline location popover

Part 3: Clean data

After adding raw datasets, we can perform some basic cleaning transforms to continue defining our pipeline. We will transform three of our raw datasets.

Dataset 1

First, let’s clean the passengers_preprocessed dataset. We will start by setting up a cast transform to change the dob column name into dob_date while converting the values to the MM/dd/yy format.

Cast transform

  1. Click on the passengers_preprocessed node in your graph.

  2. Click Transform.

    Screenshot of Passengers_preprocessed dataset

  3. Search for and select the cast transform from the dropdown to open the cast configuration board.

    Screenshot of Passengers_preprocessed dataset in transform view

  4. From the Expression field, select dob and for Type, select Date.

  5. Enter MM/dd/yy for the Format type. Be sure to use uppercase MM to ensure a successful cast transform. Change the output column name to dob_date.

    Your cast board should look like this:

    Screenshot of Cast board

  6. Click Apply to add the transform to your pipeline.

Title case transform

Now we will format the flyer_status column values to start with an uppercase letter.

  1. In the transform search field, search for and select the Title case transform to open the title case configuration board.

  2. In the Expression field, select the flyer_status column from the dropdown.

    Your title case board should look like this:

    Screenshot of Title case board

  3. Click Apply to add the transform to your pipeline.

  4. In the upper left corner of the transform configuration window, rename the transform Passengers_Clean.

    Screenshot of transform

  5. Click Back to graph at the top right to return to your pipeline graph.

    Screenshot of the transform

Dataset 2

Now, let’s clean the flight_alerts_raw dataset. First, we will set up another cast transform to convert the flight-date column values into a MM/dd/yy format.

Cast transform

  1. Click the flight_alerts_raw dataset node in your graph.

  2. Click Transform.

    Screenshot of transform

  3. Search for and select the cast transform from the dropdown to open the cast configuration board. You can read the function definition listed on the right side of the selection box to learn more about the function. Screenshot of function definition

  4. In the Expression field, select the flight_date column from the dropdown.

  5. Choose Date from the Type field dropdown.

  6. Enter MM/dd/yy for the Format type. Be sure to use uppercase MM to ensure a successful cast transform.

    Your cast board should look like this:

    Screenshot of second castboard

  7. Click Apply to add the transform to your pipeline.

Clean string transform

Now, we will add a Clean string transform that will remove whitespace from category column values. For example, the transform will convert delay··· string values to delay.

  1. Search for and select the clean string transform from the dropdown to open the clean string configuration board.

  2. In the Expression field, select the category column from the dropdown.

  3. Check the boxes for all three of the Clean actions options:

    • Converts empty strings to null
    • Reduce sequences of multiple whitespace characters to a single whitespace
    • Trims whitespace at beginning and end of string

    Your clean string board should look like this:

    Screenshot of clean string board

  4. Click Apply to add the transform to your pipeline.

  5. In the upper left corner of the transform configuration window, rename the transform Flight Alerts - Clean.

  6. Click Back to graph at the top right to return to your pipeline graph.

    Screenshot of graph with Flight Alerts - Clean node

Dataset 3

Finally, let’s clean the status_mapping_raw dataset.

Clean string transform

We will only apply a Clean string transform to this dataset.

  1. Click the status_mapping_raw dataset node in your graph.

  2. Click Transform.

    Screenshot of third dataset transform

  3. In the Search transforms and columns... field, select the mapped_value column from the dropdown.

    Screenshot of mapped value column selection

  4. In the same field, search and select the clean string transform from the dropdown.

  5. Check the boxes for all three of the Clean actions options:

  • Converts empty strings to null

  • Reduce sequences of multiple whitespace characters to a single whitespace

  • Trims whitespace at beginning and end of string

    Your clean string board should look like this:

    Screenshot of second clean string board

  1. Click Apply to add the transform to your pipeline.

  2. In the upper left corner of the transform configuration window, rename the transform Status Mapping - Clean.

  3. Click Back to graph at the top right to return to your pipeline graph.

    You can see the connection between the transforms you just added and the datasets to which you applied them.

    Screenshot of second clean string board

Part 4: Join datasets

Now, let’s combine some of our cleaned datasets with joins. A join allows you to combine datasets with at least one matching column. We will add two joins to our pipeline workflow.

Join 1

Our first join will combine two of our cleaned datasets.

  1. Click on the Flight Alerts - Clean transform node. This will be the left side of our join.

  2. Select Join.

    Screenshot of configure join option

  3. Click the Status Mapping - Clean node to add it as the right side of the join.

  4. Click Start to open the join configuration board.

    Screenshot of configure join option

  5. Verify that the Join type is set to Left join.

  6. Set the Match condition columns to status is equal to value.

  7. Click Show advanced to view additional configuration options.

  8. Set the Prefix of the right Status Mapping - Clean dataset to status.

    Your join configuration board should look like this:

    Screenshot of input tables

  9. Click Apply to add the join to your pipeline.

  10. View a preview of the join output table in the Preview pane at the bottom of the configuration window.

    Screenshot of Pipeline preview pane

  11. In the upper left corner of the join configuration window, rename the join Join Status.

  12. Click Back to graph at the top right to return to your pipeline graph.

    Screenshot of Pipeline preview pane

  13. To make the graph easier to read, click on the Layout icon to automatically arrange the datasets or manually drag the two connected datasets next to each other.

    Screenshot of Pipeline reorganized

Join 2

For our second join, we will combine our first join output table with another raw dataset.

  1. Add the priority_mapping_raw dataset to the graph by clicking Add datasets.

  2. Click on the Join Status node we just added to our graph. This will be the left side of our join.

  3. Select Join.

  4. Click on the priority_mapping_raw dataset node to add it as the right side of our join.

  5. Click Start to open the configuration board.

    Screenshot of Pipeline preview pane

  6. Verify that the Join type is set to Left join.

  7. Set the Match condition columns to priority is equal to value.

  8. Click Show advanced to view additional configuration options.

  9. Set the Prefix of the right priority_mapping_raw dataset to priority.

    Your join configuration board should look like this:

    Screenshot of Pipeline preview pane

  10. Click Apply to add the join to your pipeline.

  11. View a preview of the join output table in the Preview pane at the bottom of the configuration window.

    Screenshot of Pipeline preview pane

  12. In the upper left corner of the join configuration window, rename the join Join (2).

  13. Click Back to graph at the top right to return to your pipeline graph.

You can now see the connection between the joins you just added and the datasets to which you applied them.

Screenshot of Pipeline preview pane

Part 5: Add an output

Now that we have finished transforming and structuring our data, let’s add an output. For this tutorial, we will add a dataset output.

  1. In the Pipeline outputs sidebar to the right of the Pipeline Builder graph, name the output Flight Alerts data. Then click Add dataset output.

  2. Link Join (2) to the output by clicking on the white circle to the right of the join node and connecting it to the Flight Alerts datadataset.

  3. Click Use input schema to use existing schema.

  4. From here, select the columns of data to keep. In our case, we'll keep all the data together.

    Screenshot of scheme-filled dataset output pane

Part 6: Build the pipeline

To build your pipeline, make sure to click Save, then Deploy > Deploy pipeline.

Screenshot of scheme-filled dataset output pane

You should see a small alert indicating the deploy was successful. Click on View in the alert box to open the Build progress page.

Screenshot of build progress page

From this page, you can monitor the progress of your build until the dataset output is ready.

Screenshot of build progress with succeed status page

You can now access your dataset by clicking Actions > Open.

Screenshot of dataset output

With this last step, we have generated our pipeline output. This output is a dataset that can be further explored in other apps in Foundry such as Contour or Code Workbook.