What is a data pipeline?

The overall goal of data integration in Foundry is to provide a digital view of the objective reality within your Organization. Achieving this goal typically requires syncing data from many source systems, imposing a common schema, combining datasets together, and enabling teams to build use cases off a common data foundation.

Within this context, the term "data pipeline" is widely used to refer to the flow of data from a source system through intermediate datasets to ultimately produce high-quality, curated datasets that can be structured into the Ontology or serve as the foundation of machine learning and analytical workflows.

Although any two datasets in Foundry that are connected together via transformation logic could be considered a pipeline, in practice what we refer to as a "data pipeline" is more constrained. Typically, a pipeline has a notion of ownership—a person or group of people who oversees the pipeline to ensure that data flows through it regularly and reliably to power business processes.

Beyond a notion of ownership, there are several other characteristics associated with a high-quality, production-ready data pipeline. We explore these ideas throughout the rest of this document and provide links to additional resources to learn more:

In addition to features common to all pipelines, consider which type of pipeline should be created for your data foundation based on factors such as data scale, latency requirements, and maintenance complexity. There are three primary types of pipelines available in Foundry: batch, incremental, and streaming. Learn more about the types of pipelines.

Pipeline setup

Foundry's Pipeline Builder enables users to quickly and easily set up a pipeline with a streamlined point-and-click interface. With Pipeline Builder, users gain the benefits of Git-style change management, data health checks, multi-modal security, and fine-grain data auditing.

Technical users can build and maintain pipelines more rapidly than ever before, focusing on declarative descriptions of their end-to-end pipelines and desired outputs. Additionally, Pipeline Builder's point-and-click, form-based interface enables less technical users to create pipelines through a simplified approach.

Build scheduling

Simply put, a series of data transformations must run regularly in order to be considered a data pipeline. Defining build schedules in Foundry is a basic step in building a pipeline, as downstream data consumers expect data to update regularly. The frequency with which data flows through a pipeline is subject to organizational requirements: some pipelines may run only weekly or daily, while others run on an hourly or even more frequent basis.

The following resources help you get started with scheduling builds in Foundry:

Data quality

In the initial stages of defining a pipeline, we recommend frequently checking the quality of inputs and outputs at every step. Data synced from source systems often includes undefined values and poorly formatted or inconsistent data. Cleaning and normalizing data is a core part of the pipeline building process.

Tools for checking assumptions about datasets are available throughout Foundry:

  • Dataset Previews support computing statistics on any column of a dataset, as well as filtering to a subset of rows to quickly check expectations.
  • Code Repositories' support for debugging transforms can be used to check that input datasets are structured as expected while authoring transformation logic.
  • Applications in Foundry's analytical suite, especially Contour, can be very helpful for validating assumptions about datasets in a point-and-click fashion.

After your pipeline has been established, health checks are the recommended way to validate that data remains high-quality over time. Here are some resources to get started with health checks:

Security and governance

Foundry's platform security primitives provide best-in-class capabilities for securing a data foundation and ensuring that sensitive data is handled appropriately. The cross-cutting concepts of Projects and Markings provide support for discretionary and mandatory controls, respectively, which can be used to comply with the full range of governance requirements.

To learn more about how to handle data securely in your pipeline, refer to these sections:

Support processes and documentation

Once a pipeline has been published to production by following the above guidelines, it is important to think through the longevity of the pipeline from an organizational perspective. Support processes for pipeline maintenance should be fleshed out, expectations should be clearly defined, and documentation should be available so that pipelines remain high-quality even as they are handed off from one team to another.

Learn more about these best practices: