3. [Repositories] Creating a Project Output1. About This Course

1 - About this Course

This content is also available at learn.palantir.com ↗ and is presented here for accessibility purposes.

Context

Always be documenting. Foundry applications and project structures that support data pipelines provide ample opportunities for your to let your current and future team know the relevant facts about your data transformations. Having preprocessed your data, it’s time to clean it and prepare it for use downstream. This means airtight transform syntax as much as it means documenting the scope and logic every step of the way.

⚠️ Course prerequisites

  • DATAENG 02: If you have not completed the previous course in this track, do so now.
  • Necessary permissions to create Code Repositories. Please reach out to your program administrator or Palantir point of contact if you need authorization.
  • General familiarity with code-based data transformation: This course will provide PySpark code snippets, so PySpark specific knowledge is not necessary, though an basic understanding of the use of code (e.g., SQL, Java, Python, R) to transform data will provide a conceptual advantage.
  • General familiarity with source code management workflows in Git ↗ (branching and merging) is useful but not required.

Outcomes

In this tutorial, you’ll engineer a “clean” output for your project to be consumed by downstream pipelines and use cases. The code you’ll be implementing makes use of common PySpark features for transforming data inputs, and a significant portion of the tutorial will require you to explore selected documentation entries that expound on PySpark best practices. As a reminder however, teaching PySpark syntax patterns is outside the scope of this course.

🥅 Learning Objectives

  1. Understand the distinction between preprocessing and cleaning.
  2. Document the datasource stage of your pipeline.

💪 Foundry Skills

  • Create a multi-input transform file.
  • Use Contour to validate a proposed data transform.
  • Generate a Data Lineage graph as documentation for the Datasource project segment of your production pipeline.
  • Generate a README file in your code repository