5B. [Repositories] Publishing and Using Shared Libraries in Code Repositories1. About This Course

1 - About this course

This content is also available at learn.palantir.com ↗ and is presented here for accessibility purposes.

Raw datasets are typically highly restricted, because they often contain malformed or sensitive data unfit for downstream consumption. As you’ve learned in this training track, the chief output of a datasource project is a clean dataset that can be used in multiple cases, including as the next step in a production data pipeline. In the previous tutorial, you transformed raw JSON and CSV files into preprocessed “passenger” datasets contained in Datasource Project: Passengers. The next step is to generate a clean dataset output.

Your organization may have common data formats that would benefit from a standardized set of cleaning utilities that can be applied across transform use cases. Rather than inefficiently repeating the same cleaning utility code for each use, you can develop and publish Python code libraries to share across the enterprise.

⚠️ Course Prerequisites

  • DATAENG 05a: Working with Raw Files in Code Repositories: If you have not completed the previous course, do so now.

Outcomes

Publishing and consuming shared Python code libraries across an organization is an important part of a Foundry data engineer’s toolkit. In the process of creating clean passenger data outputs from your datasource project (i.e., passengers_clean and passengers_flight_alerts_clean), you’ll also create a cleaning utility, publish it, and make use of it in another transform. Specifically, you'll be transitioning the cleaning functions from Introduction to Data Transformation with Code Repositories into a shared library and and referencing them in both of your datasource repositories. After cleaning the passenger data, create an output passenger dataset that unions the JSON and CSV pipelines together.

🥅 Learning Objectives

  1. Understand how Foundry generally makes packages available.
  2. Know how to write, publish, and use a Python library.
  3. Additional practice generating clean dataset outputs form a datasource project.

💪 Foundry Skills

  • Write a cleaning utility function.
  • Publish your cleaning utility as a shared Python library.
  • Implement a shared library in another code repository.