This content is also available at learn.palantir.com ↗ and is presented here for accessibility purposes.
Raw datasets are typically highly restricted, because they often contain malformed or sensitive data unfit for downstream consumption. As you’ve learned in this training track, the chief output of a datasource project is a clean dataset that can be used in multiple cases, including as the next step in a production data pipeline. In the previous tutorial, you transformed raw JSON and CSV files into preprocessed “passenger” datasets contained in Datasource Project: Passengers. The next step is to generate a clean dataset output.
Your organization may have common data formats that would benefit from a standardized set of cleaning utilities that can be applied across transform use cases. Rather than inefficiently repeating the same cleaning utility code for each use, you can develop and publish Python code libraries to share across the enterprise.
Publishing and consuming shared Python code libraries across an organization is an important part of a Foundry data engineer’s toolkit. In the process of creating clean passenger data outputs from your datasource project (i.e., passengers_clean
and passengers_flight_alerts_clean
), you’ll also create a cleaning utility, publish it, and make use of it in another transform. Specifically, you'll be transitioning the cleaning functions from Introduction to Data Transformation with Code Repositories into a shared library and and referencing them in both of your datasource repositories. After cleaning the passenger data, create an output passenger dataset that unions the JSON and CSV pipelines together.