6 - Pipeline Stages

This content is also available at learn.palantir.com ↗ and is presented here for accessibility purposes.

When preparing a new pipeline or re-organizing an existing one, consider configuring Projects to organize boundaries around discrete pipeline stages.

📚 Recommended Reading (~15 min read)

Consider the project structure suggested here a starting point for pipeline implementation, though you may ultimately choose to diverge from it.

What follows is a brief summary of the inputs, outputs, and characteristics of each pipeline stage (the documentation linked above contains additional guidance).

Datasource project

Input = raw data from Data Connection
Output = cleaned version of this data
A datasource pipeline maps to a distinct source
There is no joining from other data sources at this stage
A datasource project only outputs datasets

Transform project

Input = output data from Datasource projects/pipelines
Output = canonical view of data to feed into the ontology layer
A transform project may use inputs from multiple data sources
A transform project only outputs datasets

Ontology project

Input = output data from transform projects/pipelines
Output = canonical datasets that conform to the definition of a single or related-group of objects defined in the Ontology
The output data assets in this project are synchronized to the Ontology

Workflow project (not addressed in this track)

Input = output from Ontology project
Output = collection of artifacts designed to solve a specific business use case

The DATAENG learning path will assume data has already been connected from an upstream source and will stop short of generating use case artifacts (covered in other learning paths). You'll therefore be working uniquely with Datasource, Transform, and Ontology projects.