5C. [Repositories] Multiple Outputs with Data Transforms1. About This Course

1 - About this Course

This content is also available at learn.palantir.com ↗ and is presented here for accessibility purposes.

After a Datasource Project has generated a set of clean outputs, the next stage in a pipeline — the Transform Project — prepares data to feed into the Ontology layer. These projects import the cleaned datasets from one or more Datasource Projects, join them with lookup datasets to expand values, normalize or de-normalize relationships to create object-centric or time-centric datasets, or aggregate data to create standard, shared metrics.

Up to this point in the Data Engineering training track, you’ve authored code-based data transformations that output a single dataset. Foundry transform APIs provide at least two ways to generate multiple outputs in a single transform file. This is helpful in cases where you want to programmatically brake inputs into distinctive parts. In this tutorial, you’ll explore one of the available methods for outputting multiple datasets from a single transform as you take your pipeline into the Transform Project phase.

⚠️ Course Prerequisites

  • DATAENG 05b: If you have not completed the previous course in this track, do so now.

Outcomes

The exercises in this tutorial will take the clean outputs from your Datasource project: Flight Alerts and Datasource Project: Passengers and further process them using the concept of a multi-output Python transform. You’ll first generate an intermediate transform that joins the flight alerts data with the passenger data. Then you’ll create a multi-output transform that creates individual datasets of alerts based on passenger country.

🥅 Learning Objectives

  1. Gain familiarity with the Transform Project stage of a production pipeline.
  2. Understand the difference between a multi-output and a generated transform, both of which are capable of producing more than one dataset output from a single transform file.

💪 Foundry Skills

  • Create, schedule, and document the Transform Project portion of a production data pipeline.
  • Write a generated and multi-output Python transform.