There are three main types of pipelines you can create in Foundry, and each provides different tradeoffs according to a few criteria:
The three types of pipelines are:
Below, we discuss each type of pipeline, its tradeoffs, and how to get started authoring this type of pipeline. For convenience, here is a summary table of the types of pipelines according to the tradeoffs mentioned above.
Criterion | Batch | Incremental | Streaming |
---|---|---|---|
Latency | High | Low | Very low |
Complexity | Low | Medium | High |
Compute cost | Medium | Low | High |
Resilience to change in data scale | Low | High | High |
In a batch pipeline, all datasets in the pipeline are fully recomputed whenever upstream data changes. Because everything is recomputed, the end-to-end performance of the pipeline is very consistent over time, and the code and maintenance complexity of the pipeline is minimal. To enable more users to contribute to batch pipelines, a broad set of languages and tools are available for batch pipeline authoring, including SQL.
Examining batch pipelines according to the criteria above:
In most cases, you should begin pipeline development in Foundry by creating a batch pipeline and extending it to support incremental computation as the use case for the pipeline is validated. In many cases, you can keep using a batch pipeline indefinitely, especially if your data scale is low (e.g., less than tens of millions of rows).
If you expect that you will need to make your pipeline incremental in the future, we recommend using either Python or Java for batch pipeline development, as these languages support incremental computation.
Get started by learning how to create a batch pipeline in Pipeline Builder, or by following the tutorials for other languages:
In an incremental pipeline, only the rows or files of data that have changed since the last build are computed. This is suitable for processing event data and other datasets with large amounts of data changing over time. In addition to reducing the overall amount of computation, the end-to-end latency of the pipeline can be reduced significantly as compared to batch pipelines. Only Python and Java APIs are available for incremental computation.
Examining incremental pipelines according to the criteria above:
To learn more about incremental pipelines, refer to these resources:
In a streaming pipeline, your code runs continuously to process any new data that streams into Foundry, enabling the lowest levels of latency but incurring the highest amounts of complexity and compute costs. In general, it is helpful to think about streaming pipelines as closer to managing a microservice than managing a compute job—you need to be very thoughtful about uptime, resiliency, and stateful operations in order to run a streaming pipeline successfully.
Examining streaming pipelines according to the criteria above:
In most cases, it is best to avoid creating a streaming pipeline unless your use case has very low latency requirements. Incremental pipelines can often be made performant down to minute-level end-to-end latencies to meet most needs without incurring the added complexity and compute costs of streaming pipelines.
To learn more about streaming pipelines, refer to the following resources:
Additional documentation for streaming pipelines will be available soon. If you are interested in building a streaming pipeline, contact your Palantir representative.