2. [Builder] Introduction to Data Transformations1. About This Course

1 - About this Course

This content is also available at learn.palantir.com ↗ and is presented here for accessibility purposes.

Context

The DATAENG learning path currently assumes a connection to an external source has already been made, and that source is providing you with a set of raw, “copied” datasets. For convenience, your Foundry environment comes prepared with these raw training datasets. In reality, integrating raw datasets via the data connection process and creating pre-processed and cleaned versions of those datasets for downstream use are all steps along the continuum of Foundry data engineering. For more details on the Data Connection process, consult the relevant product documentation.

Once your team has agreed on the datasets and transformation steps needed to achieve your outcome, it’s time to start developing your data assets. The Pipeline Builder application contains a fully integrated suite of tools that let you configure transformation logic and then build new data transformations as part of a production pipeline. There are several Foundry applications capable of transforming and outputting datasets (e.g., Code Repositories, Contour, Code Workbook, Preparation, Fusion), but for reasons we’ll explore throughout the learning path, production pipelines should only be built with Pipeline Builder or—if specialized code is needed—the Code Repositories application.

⚠️ Course prerequisites

  • DATAENG 01 Data Pipeline Foundations: If you have not completed the previous course in this track, do so now.

Outcomes

In the previous tutorial, you created a series of folders that implements a recommended pipeline project structure. You’ll now use the Pipeline Builder application to generate the initial datasets in your pipeline.

You’ll be starting with three raw datasets. The first contains data about flight alerts, including columns indicating the status of the alert and the priority. In their raw form, these two columns contain numeric values only that must be mapped to strings use the other two raw datasets that serve as mapping tables (e.g., priority of “1” in dataset A must be converted to “High” using dataset B). Then, you’ll use Pipeline Builder to normalize and format the data using some basic transforms. You’ll stop short of doing any mapping between the raw files—your first goal is simply to pre-process them for further cleaning and eventual joining downstream (in a subsequent tutorial).

In short, the inputs to this training are the simulated raw data sets from an upstream source and the outputs will be “pre-processed” datasets formatted for further cleaning in the next tutorial.

🥅 Learning Objectives

  1. Start your pipeline in the Pipeline Builder application.
  2. Understand the importance of a pre-processing and cleaning in a data pipeline development.
  3. Gain additional practice transforming data in Pipeline Builder.

💪 Foundry Skills

  • Create a pipeline using Pipeline Builder.
  • Transform data with Pipeline Builder and generate output datasets.