This content is also available at learn.palantir.com ↗ and is presented here for accessibility purposes.
Currently, the Data Engineering training track picks up where the data ingestion process ends—with a set of raw, “copied” datasets from a notional upstream source. Your Foundry environment comes prepared with these raw, training datasets that we’ll take for granted as our starting point for convenience. In reality, landing raw datasets via the data connection process and creating pre-processed and cleaned versions of those datasets for downstream use are all steps along the continuum of Foundry data engineering. For more details on the Data Connection process, consult the relevant product documentation.
Once your team has agreed on the datasets and transformation steps needed to achieve your outcome, it’s time to start developing your data assets in a Foundry code repository. The Code Repository application contains a fully integrated suite of tools that let you write, publish, and build data transformations as part of a production pipeline. There are several Foundry applications capable of transforming and outputting datasets (e.g., Contour, Code Workbook, Preparation, Fusion), but for reasons we’ll explore throughout the track, production pipelines should only be built either with the Code Repositories application or the Pipeline Builder application. Note that you can pursue this same tutorial via Pipeline Builder as well.
In the previous tutorial, you created a series of folder that implements a recommended pipeline project structure. You’ll now use the Code Repositories application to generate the initial datasets in your pipeline.
For training convenience, you’ll begin by creating copies of the starting raw datasets into the Datasource Project you constructed in the previous tutorial. You’ll be working with three raw datasets. The first contains data about flight alerts, including columns indicating the status of the alert and the priority. In their raw form, these two columns contain numeric values only that must be mapped to strings use the other two raw datasets that serve as mapping tables (e.g., priority of “1” in dataset A must be converted to “High” using dataset B).
Then, you’ll use PySpark to normalize and format the data using some basic cleaning utilities. You’ll stop short of doing any mapping between the raw files—your first goal is simply to pre-process them for further cleaning and eventual joining downstream (in a subsequent tutorial). In short, the inputs to this training are the simulated raw data sets from an upstream source and the outputs will be “pre-processed” datasets formatted for further cleaning in the next tutorial.