In this tutorial, we will use Pipeline Builder to create a simple pipeline with an output of a single dataset with information on flight alerts. We can then analyze this output dataset with tools like Contour or Code Workbook to answer questions such as which flight paths have the greatest risk of disruption.
The datasets used below are searchable by name in the dataset import step, and can be found in the Foundry Reference Project in your Foundry filesystem:
Foundry Training and Resources/Foundry Reference Project/Tutorial Reference Examples/Track: Data Engineering/Datasource Project: Flight Alerts/datasets
.
At the end of this tutorial, you will have a pipeline that looks like the following:
The pipeline will produce a new dataset output of Flight Alerts Data
, which can be used for further exploration.
First, we need to create a new pipeline.
When logged into Foundry, access Pipeline Builder from the left-side navigation bar under Apps. If it isn’t there, click View all and find Pipeline Builder under the Build & Monitor Pipelines section.
Next, on the top right of the Pipeline Builder landing page, create a new pipeline by clicking New pipeline. Select Batch pipeline.
The ability to create a streaming pipeline is not available on all Foundry environments. Contact your Palantir representative for more information if your use case requires it.
Now we can add datasets to our pipeline workflow. For this tutorial, we will use sample datasets of notional or open-source data, and all datasets should be available as part of the Foundry Reference Project in your Foundry filesystem.
From the Pipeline Builder page, click Add datasets from Foundry.
Alternatively, you can drag-and-drop a file from your computer to use as your dataset.
In our walkthrough example, we will add the passengers_preprocessed
, flight_alerts_raw
, and status_mapping_raw
datasets. To add a selection of datasets, select the dataset and click the + icon inline, or click Add to Selection.
When all datasets required are selected, click Add datasets.
After adding raw datasets, we can perform some basic cleaning transforms to continue defining our pipeline. We will transform three of our raw datasets.
First, let’s clean the passengers_preprocessed
dataset. We will start by setting up a cast transform to change the dob
column name into dob_date
while converting the values to the MM/dd/yy format.
Click on the passengers_preprocessed
node in your graph.
Click Transform.
Search for and select the cast transform from the dropdown to open the cast configuration board.
From the Expression field, select dob
and for Type, select Date
.
Enter MM/dd/yy
for the Format type. Be sure to use uppercase MM
to ensure a successful cast transform. Change the output column name to dob_date
.
Your cast board should look like this:
Click Apply to add the transform to your pipeline.
Now we will format the flyer_status
column values to start with an uppercase letter.
In the transform search field, search for and select the Title case transform to open the title case configuration board.
In the Expression field, select the flyer_status
column from the dropdown.
Your title case board should look like this:
Click Apply to add the transform to your pipeline.
In the upper left corner of the transform configuration window, rename the transform Passengers_Clean
.
Click Back to graph at the top right to return to your pipeline graph.
Now, let’s clean the flight_alerts_raw
dataset. First, we will set up another cast transform to convert the flight-date
column values into a MM/dd/yy
format.
Click the flight_alerts_raw
dataset node in your graph.
Click Transform.
Search for and select the cast transform from the dropdown to open the cast configuration board. You can read the function definition listed on the right side of the selection box to learn more about the function.
In the Expression field, select the flight_date
column from the dropdown.
Choose Date
from the Type field dropdown.
Enter MM/dd/yy
for the Format type. Be sure to use uppercase MM
to ensure a successful cast transform.
Your cast board should look like this:
Click Apply to add the transform to your pipeline.
Now, we will add a Clean string transform that will remove whitespace from category
column values. For example, the transform will convert delay···
string values to delay
.
Search for and select the clean string transform from the dropdown to open the clean string configuration board.
In the Expression field, select the category
column from the dropdown.
Check the boxes for all three of the Clean actions options:
Your clean string board should look like this:
Click Apply to add the transform to your pipeline.
In the upper left corner of the transform configuration window, rename the transform Flight Alerts - Clean
.
Click Back to graph at the top right to return to your pipeline graph.
Finally, let’s clean the status_mapping_raw
dataset.
We will only apply a Clean string transform to this dataset.
Click the status_mapping_raw
dataset node in your graph.
Click Transform.
In the Search transforms and columns... field, select the mapped_value
column from the dropdown.
In the same field, search and select the clean string transform from the dropdown.
Check the boxes for all three of the Clean actions options:
Converts empty strings to null
Reduce sequences of multiple whitespace characters to a single whitespace
Trims whitespace at beginning and end of string
Your clean string board should look like this:
Click Apply to add the transform to your pipeline.
In the upper left corner of the transform configuration window, rename the transform Status Mapping - Clean
.
Click Back to graph at the top right to return to your pipeline graph.
You can see the connection between the transforms you just added and the datasets to which you applied them.
Now, let’s combine some of our cleaned datasets with joins. A join allows you to combine datasets with at least one matching column. We will add two joins to our pipeline workflow.
Our first join will combine two of our cleaned datasets.
Click on the Flight Alerts - Clean
transform node. This will be the left side of our join.
Select Join.
Click the Status Mapping - Clean
node to add it as the right side of the join.
Click Start to open the join configuration board.
Verify that the Join type is set to Left join
.
Set the Match condition columns to status
is equal to value
.
Click Show advanced to view additional configuration options.
Set the Prefix of the right Status Mapping - Clean
dataset to status
.
Your join configuration board should look like this:
Click Apply to add the join to your pipeline.
View a preview of the join output table in the Preview pane at the bottom of the configuration window.
In the upper left corner of the join configuration window, rename the join Join Status
.
Click Back to graph at the top right to return to your pipeline graph.
To make the graph easier to read, click on the Layout icon to automatically arrange the datasets or manually drag the two connected datasets next to each other.
For our second join, we will combine our first join output table with another raw dataset.
Add the priority_mapping_raw
dataset to the graph by clicking Add datasets.
Click on the Join Status
node we just added to our graph. This will be the left side of our join.
Select Join.
Click on the priority_mapping_raw
dataset node to add it as the right side of our join.
Click Start to open the configuration board.
Verify that the Join type is set to Left join
.
Set the Match condition columns to priority
is equal to value
.
Click Show advanced to view additional configuration options.
Set the Prefix of the right priority_mapping_raw
dataset to priority
.
Your join configuration board should look like this:
Click Apply to add the join to your pipeline.
View a preview of the join output table in the Preview pane at the bottom of the configuration window.
In the upper left corner of the join configuration window, rename the join Join (2)
.
Click Back to graph at the top right to return to your pipeline graph.
You can now see the connection between the joins you just added and the datasets to which you applied them.
Now that we have finished transforming and structuring our data, let’s add an output. For this tutorial, we will add a dataset output.
In the Pipeline outputs sidebar to the right of the Pipeline Builder graph, name the output Flight Alerts data
. Then click Add dataset output.
Link Join (2)
to the output by clicking on the white circle to the right of the join node and connecting it to the Flight Alerts data
dataset.
Click Use input schema to use existing schema.
From here, select the columns of data to keep. In our case, we'll keep all the data together.
To build your pipeline, make sure to click Save, then Deploy > Deploy pipeline.
You should see a small alert indicating the deploy was successful. Click on View in the alert box to open the Build progress page.
From this page, you can monitor the progress of your build until the dataset output is ready.
You can now access your dataset by clicking Actions > Open.
With this last step, we have generated our pipeline output. This output is a dataset that can be further explored in other apps in Foundry such as Contour or Code Workbook.