2. [Repositories] Introduction to Data Transformations1. About This Course
Feedback

1 - About this Course

This content is also available at learn.palantir.com ↗ and is presented here for accessibility purposes.

Context

Currently, the Data Engineering training track picks up where the data ingestion process ends—with a set of raw, “copied” datasets from a notional upstream source. Your Foundry environment comes prepared with these raw, training datasets that we’ll take for granted as our starting point for convenience. In reality, landing raw datasets via the data connection process and creating pre-processed and cleaned versions of those datasets for downstream use are all steps along the continuum of Foundry data engineering. For more details on the Data Connection process, consult the relevant product documentation.

Once your team has agreed on the datasets and transformation steps needed to achieve your outcome, it’s time to start developing your data assets in a Foundry code repository. The Code Repository application contains a fully integrated suite of tools that let you write, publish, and build data transformations as part of a production pipeline. There are several Foundry applications capable of transforming and outputting datasets (e.g., Contour, Code Workbook, Preparation, Fusion), but for reasons we’ll explore throughout the track, production pipelines should only be built either with the Code Repositories application or the Pipeline Builder application. Note that you can pursue this same tutorial via Pipeline Builder as well.

⚠️ Course prerequisites

  • DATAENG 01: If you have not completed the previous course in this track, do so now.
  • Necessary permissions to create Code Repositories. Please reach out to your program administrator or Palantir point of contact if you need authorization.
  • General familiarity with code-based data transformation: This course will provide PySpark code snippets, so PySpark specific knowledge is not necessary, though a basic understanding of the use of code (for example, SQL, Java, Python, R) to transform data will provide a conceptual advantage.
  • General familiarity with source code management workflows in Git ↗ (branching and merging) is useful but not required.

Outcomes

In the previous tutorial, you created a series of folder that implements a recommended pipeline project structure. You’ll now use the Code Repositories application to generate the initial datasets in your pipeline.

For training convenience, you’ll begin by creating copies of the starting raw datasets into the Datasource Project you constructed in the previous tutorial. You’ll be working with three raw datasets. The first contains data about flight alerts, including columns indicating the status of the alert and the priority. In their raw form, these two columns contain numeric values only that must be mapped to strings use the other two raw datasets that serve as mapping tables (e.g., priority of “1” in dataset A must be converted to “High” using dataset B).

Then, you’ll use PySpark to normalize and format the data using some basic cleaning utilities. You’ll stop short of doing any mapping between the raw files—your first goal is simply to pre-process them for further cleaning and eventual joining downstream (in a subsequent tutorial). In short, the inputs to this training are the simulated raw data sets from an upstream source and the outputs will be “pre-processed” datasets formatted for further cleaning in the next tutorial.

🥅 Learning Objectives

  1. Navigate the Code Repositories environment.
  2. Learn the basic anatomy of a data transform.
  3. Understand how code management works in a Foundry code repository.
  4. Practice writing PySpark data transformations.
  5. Understand the importance of a pre-processing and cleaning in a data pipeline development.
  6. Understand basic patterns for creating and configuring a Code Repository for transforming data.

💪 Foundry Skills

  • Bootstrap a Foundry Code Repository.
  • Create and implementing reusable code utilities.
  • Implement branching and pipeline documentation best practices.