This content is also available at learn.palantir.com ↗ and is presented here for accessibility purposes.
The computation engine behind data transformations in Foundry is Spark: an open-source, distributed cluster-computing framework for quick, large-scale data processing and analytics. Spark works most efficiently on a data file type called Parquet, and by default, Foundry transforms output datasets as a series of distributed Parquet files.
All things being equal, datasets composed of Parquet files will always be more efficiently computed by Spark than other data formats. You may, however, want to process files in non-linear formats (such as XML or JSON). This tutorial reviews the essentials needed to read and write files in Foundry datasets using the @transform()
decorator (vs. @transform_df
used in the previous tutorial).
The files necessary for the next step in the development of your pipeline are in non-Parquet formats and must be directly accessed by your code for transformation.
Your data pipeline consists of clean flight alert data enhanced with some mapping files, but there’s another datasource you’d like to incorporate into the overarching project: passengers associated with these flight alerts. Your team may have decided, for example, that a workflow they’d like to enable downstream is the ability to assign travel vouchers based on flight delay/alert severity and customer status, and integrating passenger data into your pipeline is a necessary step toward creating the Ontology framework to support that interaction pattern.
The goal of this tutorial is to expose another data transformation pattern that involves directly accessing and parsing a CSV and JSON file in Foundry. Whether your non-linearly formatted data was uploaded in an ad hoc manner or originates in an external source, the methods in this course will be an important part of a data engineer’s arsenal of transformation techniques.
@transform()
decorator to access raw files in Foundry.