3. [Repositories] Creating a Project Output15. Key Takeaways

15 - Key Takeaways

This content is also available at learn.palantir.com ↗ and is presented here for accessibility purposes.

Always be documenting. Whether it’s visually with the Data Lineage application, textually with README files or other text-based resources in your project, or with in-line code comments and commit messages, being rigorous and thorough in explaining your pipeline logic and dependencies will promote rapid troubleshooting and prevent maintenance headaches.

In this tutorial you:

  1. Introduced a cleaning step in your pipeline that used PySpark to join your preprocessed files into a usable output.
  2. Verified data quality in Contour before enacting a proposed transform and saved your analysis in your Datasource project.
  3. Saved a Data Lineage representation of the datasource stage of your pipeline.
  4. Documented your pipeline using a README file in your code repository.

Below is a list of product documentation used in the course of this training:

Now that you’ve created a multi-node flow from raw to clean, you’ll work on generating a schedule to automatically run the transforms in sequence. Skillfully scheduling pipelines is an important part of pipeline monitoring, and in the next tutorial, you’ll use the Scheduler application in a recommended configuration and produce written documentation about your pipeline to facilitate troubleshooting and maintenance activities.