Define data expectations

Data expectations are a set of requirements defined in code on dataset inputs or outputs. These requirements, or "expectations," can be used to create checks that improve data pipeline stability. If a Data Expectation check fails as part of a dataset build, the build can be automatically aborted in order to save time and resources, and avoid issues in downstream data. Data Expectations are integrated with Data Health for monitoring.

Get started by viewing the guide in Python Transforms documentation, or see the reference of all available expectations.

Benefits of using Data Expectations

  • Pipeline protection - Because Expectations run as part of the build, you can abort builds on failed expectations and prevent downstream failures and bad data from propagating to downstream resources.
  • Change management - Since Expectations are defined in Code Repositories, changing them will require the same pull-request review process that is set on your protected branches.
  • Proactive testing - Anticipate check failures before merging changes to protected branches by building them on a development branch

Terminology

  • Expectation - A strongly typed requirement on the data structure or content (e.g. column is not null)
  • Check - A meaningful expectation (can be a composite of multiple expectations) that is connected to a single dataset (output or input) in a transform. A check has a name that will be used when identifying and monitoring it (e.g. “object schema validation”).
    • Pre-condition - A check that is assigned to an input of a transform, typically to validate essential assumptions on the inputs structure or content before proceeding with the build.
    • Post-condition - A check that is assigned to the output of the transform, typically to guarantee dataset SLAs are maintained and downstream dependencies are protected.
  • Check result - Produced when the check runs (during build) and contains information on the expectations result and their breakdown. The check result can be monitored in Data Health.

How does it work?

Define

Data Expectations are defined on the dataset transform in the relevant Code Repository. Checks can be applied on the transform inputs and outputs (see the guide for details). The check name must be unique in a single transform.

Alongside its expectation, a check defines how failures are handled during build time. When a check fails the build can either be aborted or resumed with a warning.

The check is registered during CI on the relevant branch. Changing the expectations on protected branches will require a pull-request just like any other code change.

When making changes to protected branches, it is recommended to build the dataset on the development branch to ensure your Data Expectations are met before merging changes to the default branch.

Run

The registered checks will run as part of the build job. Failure to meet data expectations will be highlighted in the Builds application and in the dataset History tab. If the check definition indicates FAIL on error, the job status will change to “Aborted” with an appropriate error. In the Job timeline you can find the “Expectations” indicator; clicking on the indicator will show the check results and breakdown of the different expectations.

When a pre-condition fails the output of the transform will be aborted (rather than the input on which the pre-condition was defined). To abort builds of input datasets, the Data Expectation must be defined as a post-condition on the input dataset transform.

Monitor

Each check run produces a result that is reported to Data Health. The most recent Data Expectations results will be presented in the Dataset Preview application Health tab where notifications and issue triggers can be set (similar to other Data Health checks).

Remember that checks on a dataset are uniquely identified by their name. The history of a check, as well as its individual monitoring settings, will remain only as long as its name doesn’t change. Changing the name of a check is equivalent to removing the old check and creating a new one in its place.

Incremental

All checks run on full datasets, regardless of the incremental nature of the transform.

For example, let's assume we have a primary key check on the output of a transform running as incremental. Since data expectations checks are always run on the full dataset, the check will fail if a new primary key is included in the new transaction (which is about to be written, incrementally) and the same primary key has already been written (in a previous transaction).