1. Data Pipeline Foundations6. Pipeline Stages

6 - Pipeline Stages

This content is also available at learn.palantir.com ↗ and is presented here for accessibility purposes.

When preparing a new pipeline or re-organizing an existing one, consider configuring Projects to organize boundaries around discrete pipeline stages.

Consider the project structure suggested here a starting point for pipeline implementation, though you may ultimately choose to diverge from it.

What follows is a brief summary of the inputs, outputs, and characteristics of each pipeline stage (the documentation linked above contains additional guidance).

Datasource project

  • Input = raw data from Data Connection
  • Output = cleaned version of this data
  • A datasource pipeline maps to a distinct source
  • There is no joining from other data sources at this stage
  • A datasource project only outputs datasets

Transform project

  • Input = output data from Datasource projects/pipelines
  • Output = canonical view of data to feed into the ontology layer
  • A transform project may use inputs from multiple data sources
  • A transform project only outputs datasets

Ontology project

  • Input = output data from transform projects/pipelines
  • Output = canonical datasets that conform to the definition of a single or related-group of objects defined in the Ontology
  • The output data assets in this project are synchronized to the Ontology

Workflow project (not addressed in this track)

  • Input = output from Ontology project
  • Output = collection of artifacts designed to solve a specific business use case

The DATAENG learning path will assume data has already been connected from an upstream source and will stop short of generating use case artifacts (covered in other learning paths). You'll therefore be working uniquely with Datasource, Transform, and Ontology projects.