Types of pipelines

There are three main types of pipelines you can create in Foundry, and each provides different tradeoffs according to a few criteria:

  • Latency. How long does the pipeline take to run end-to-end?
  • Complexity. How difficult is it to author the pipeline and maintain it over time?
  • Compute cost. How much does the pipeline actually cost in terms of compute hours?
  • Resilience to change in data scale. How much additional data will flow into your pipeline over time?

The three types of pipelines are:

  • Batch pipelines fully recompute every dataset that has changed on each run.
  • Incremental pipelines only process new data that has changed since the last run.
  • Streaming pipelines run continuously and process new data as it arrives in the platform.

Below, we discuss each type of pipeline, its tradeoffs, and how to get started authoring this type of pipeline. For convenience, here is a summary table of the types of pipelines according to the tradeoffs mentioned above.

CriterionBatchIncrementalStreaming
LatencyHighLowVery low
ComplexityLowMediumHigh
Compute costMediumLowHigh
Resilience to change in data scaleLowHighHigh

Batch

In a batch pipeline, all datasets in the pipeline are fully recomputed whenever upstream data changes. Because everything is recomputed, the end-to-end performance of the pipeline is very consistent over time, and the code and maintenance complexity of the pipeline is minimal. To enable more users to contribute to batch pipelines, a broad set of languages and tools are available for batch pipeline authoring, including SQL.

Examining batch pipelines according to the criteria above:

  • Latency of batch pipelines can be very high, as all data must be processed whenever it lands into the platform, even if it has not changed since the last sync. However, latency may be low if the overall data scale is small.
  • Complexity of batch pipelines is very low. Although it is usually still necessary to understand table manipulation concepts such as filters, joins, and aggregations, minimal further knowledge is necessary.
  • Compute costs for batch pipelines can be high, as a large amount of repeated computation occurs on every build of the pipeline. Again, this factor can be ignored if overall data scale is small.
  • Resilience to change in data scale is low. If large amounts of new data flow into the pipeline, such as when input datasets represents high-volume events, batch pipelines will become unmanageably expensive and take too long to run.

In most cases, you should begin pipeline development in Foundry by creating a batch pipeline and extending it to support incremental computation as the use case for the pipeline is validated. In many cases, you can keep using a batch pipeline indefinitely, especially if your data scale is low (e.g., less than tens of millions of rows).

If you expect that you will need to make your pipeline incremental in the future, we recommend using either Python or Java for batch pipeline development, as these languages support incremental computation.

Get started by learning how to create a batch pipeline in Pipeline Builder, or by following the tutorials for other languages:

Incremental

In an incremental pipeline, only the rows or files of data that have changed since the last build are computed. This is suitable for processing event data and other datasets with large amounts of data changing over time. In addition to reducing the overall amount of computation, the end-to-end latency of the pipeline can be reduced significantly as compared to batch pipelines. Only Python and Java APIs are available for incremental computation.

Examining incremental pipelines according to the criteria above:

  • Latency of incremental pipelines can be made very low—on the order of minutes. In our experience, this is sufficiently low latency to meet the vast majority of organizational requirements.
  • Complexity of incremental pipelines is medium-to-high. Rather than just operating on the high-level concept of a tabular dataframe, writing and maintaining an incremental pipeline requires understanding how data flows into Foundry dataset transactions, how to handle cases when input datasets are updated non-incrementally, and how to maintain high performance over time.
  • Compute costs for incremental pipelines can be lower than for batch pipelines on high-scale datasets, as the amount of actual computation occurring can be minimized.
  • Resilience to change in data scale is high. Because only new data is processed, incremental pipelines avoid needing to redo computation on large datasets where most of the data is not changing.

To learn more about incremental pipelines, refer to these resources:

Streaming

In a streaming pipeline, your code runs continuously to process any new data that streams into Foundry, enabling the lowest levels of latency but incurring the highest amounts of complexity and compute costs. In general, it is helpful to think about streaming pipelines as closer to managing a microservice than managing a compute job—you need to be very thoughtful about uptime, resiliency, and stateful operations in order to run a streaming pipeline successfully.

Examining streaming pipelines according to the criteria above:

  • Latency of streaming pipelines can be made very low. If you have a strong requirement for data to be available for a use case in less than one minute, then streaming pipelines may be a good fit.
  • Complexity of streaming pipelines is very high. Writing these pipelines requires avoiding a broad set of patterns, such as stateful operations, which can unexpectedly cause instability in the future. In addition, streaming pipelines tend to have lower tolerance to failure than batch or incremental pipelines, as downtime can result in use cases that depend on consistently fresh data to consider any interruption as an outage.
  • Compute costs for streaming pipelines can be very high, as the nature of streaming pipelines requires compute resources to always be available to process new input data.
  • Resilience to change in data scale is high. Streaming pipelines are designed to support high throughput and can typically process higher changes in data scale than batch or even incremental pipelines.

In most cases, it is best to avoid creating a streaming pipeline unless your use case has very low latency requirements. Incremental pipelines can often be made performant down to minute-level end-to-end latencies to meet most needs without incurring the added complexity and compute costs of streaming pipelines.

To learn more about streaming pipelines, refer to the following resources:

Additional documentation for streaming pipelines will be available soon. If you are interested in building a streaming pipeline, contact your Palantir representative.