2 - Intro to Data Pipelines

This content is also available at learn.palantir.com ↗ and is presented here for accessibility purposes.

A data pipeline is a flow of data from a source system through intermediate datasets to ultimately produce high-quality, curated datasets that can be structured into the Ontology or serve as the foundation of machine learning and analytical workflows.

In this exercise we’ll review the basic stages of the pipeline development process. A data pipeline lifecycle typically involves these distinct activities:

Agree on the desired output(s)
Determine the source data needed to support the output(s)
Define the pipeline scope and service level agreements (SLA)
Map the pipeline stages and create the associated project structure
Test, build, and optimize your transforms
Apply schedule and dataset health checks
Create a pipeline schedule
Maintain your pipeline