This content is also available at learn.palantir.com ↗ and is presented here for accessibility purposes.
When data pipelines need to run at a particular cadence and without manual intervention, you should configure a schedule to automatically build your data pipeline.
In addition to clean code and good documentation, a reliable product data pipeline needs automated schedule logic. It’s generally not prudent, however, to simply brute force the builds in your pipeline sequentially. Because running data transformations require Spark computation, they should be wisely planned to avoid potentially unnecessary, expensive resource consumption. As you’ll see in this tutorial, the pipeline scheduler interface makes it easy to quickly define your pipeline schedule inputs and outputs (“what” will build) and the execution conditions (“when” it should build).
Once your pipeline (or a stage in your pipeline) is built and scheduled (and has monitoring applied, which is addressed in a later tutorial), we strongly recommend documenting the execution logic and other key pipeline features in your project.
DATAENG 03: If you have not completed the previous course in this track, do so now.
This tutorial is aimed at conveying the basics and best practices for creating automated data pipeline schedules. Foundry’s Scheduler application abstracts pipeline configuration into an intuitive interface that enables you to set the execution conditions to maintain your data freshness SLAs and minimize the risk of wasting Spark computation resources.
After scheduling your pipeline, you’ll have the opportunity to add documentation about the features of the Datasource Project stage. Descriptive documentation about pipeline logic, SLAs, maintenance procedures, and troubleshooting history is a valuable prophylactic against entropy and sets your project up for long-term maintainability.