2. [Builder] Introduction to Data Transformations5. Add A Preprocessing Pipeline

5 - Add a Preprocessing Pipeline

This content is also available at learn.palantir.com ↗ and is presented here for accessibility purposes.

📖 Task Introduction

Some of your raw data values are not optimally formatted. In this exercise, you’ll use Pipeline Builder transforms to preprocess the data. Some of the anomalies you’ll want to fix early in your pipeline include (but are not limited to):

  • The flightDate column in flight_alerts_raw is currently a string type rather than a date.
  • The mapped value columns in both mapping datasets contains extra spaces and the text is lowercase and separated by characters. You'd rather have "Open and Assigned" than the current value of "·······open_and_assigned."

You’ll use the inputs form the “raw” pipeline in the previous exercise as inputs to this step. You cannot currently add a transform to a designated output in Pipeline Builder. Consequently, we’ll create a new Pipeline Builder artifact in your /preprocessed folder path and use the datasets you just generated in /raw as inputs.

🔨 Task Instructions

  1. Open your ../Datasource Project: Flight Alerts/datasets/preprocessed/ folder.
  2. Create a new batch pipeline and name it flight_alerts_datasource_preprocessed.
  3. Use the Add datasets button to import the three datasets in your ../raw folder.
  4. Consider applying a color and associated label to these datasets (e.g., “Raw”).