This content is also available at learn.palantir.com ↗ and is presented here for accessibility purposes.
Always be documenting. Whether it’s visually with the Data Lineage application, textually with README files or other text-based resources in your project, or with in-line code comments and commit messages, being rigorous and thorough in explaining your pipeline logic and dependencies will promote rapid troubleshooting and prevent maintenance headaches.
In this tutorial you:
Introduced a cleaning step in your pipeline that used PySpark to join your preprocessed files into a usable output.
Verified data quality in Contour before enacting a proposed transform and saved your analysis in your Datasource project.
Saved a Data Lineage representation of the datasource stage of your pipeline.
Documented your pipeline using a README file in your code repository.
Below is a list of product documentation used in the course of this training:
Now that you’ve created a multi-node flow from raw to clean, you’ll work on generating a schedule to automatically run the transforms in sequence. Skillfully scheduling pipelines is an important part of pipeline monitoring, and in the next tutorial, you’ll use the Scheduler application in a recommended configuration and produce written documentation about your pipeline to facilitate troubleshooting and maintenance activities.