Data expectations are a set of requirements defined in code on dataset inputs or outputs. These requirements, or "expectations," can be used to create checks that improve data pipeline stability. If a Data Expectation check fails as part of a dataset build, the build can be automatically aborted in order to save time and resources, and avoid issues in downstream data. Data Expectations are integrated with Data Health for monitoring.
Get started by viewing the guide in Python Transforms documentation, or see the reference of all available expectations.
Data Expectations are defined on the dataset transform in the relevant Code Repository. Checks can be applied on the transform inputs and outputs (see the guide for details). The check name must be unique in a single transform.
Alongside its expectation, a check defines how failures are handled during build time. When a check fails the build can either be aborted or resumed with a warning.
The check is registered during CI on the relevant branch. Changing the expectations on protected branches will require a pull-request just like any other code change.
When making changes to protected branches, it is recommended to build the dataset on the development branch to ensure your Data Expectations are met before merging changes to the default branch.
The registered checks will run as part of the build job. Failure to meet data expectations will be highlighted in the Builds application and in the dataset History tab. If the check definition indicates FAIL on error, the job status will change to “Aborted” with an appropriate error. In the Job timeline you can find the “Expectations” indicator; clicking on the indicator will show the check results and breakdown of the different expectations.
When a pre-condition fails the output of the transform will be aborted (rather than the input on which the pre-condition was defined). To abort builds of input datasets, the Data Expectation must be defined as a post-condition on the input dataset transform.
Each check run produces a result that is reported to Data Health. The most recent Data Expectations results will be presented in the Dataset Preview application Health tab where notifications and issue triggers can be set (similar to other Data Health checks).
Remember that checks on a dataset are uniquely identified by their name. The history of a check, as well as its individual monitoring settings, will remain only as long as its name doesn’t change. Changing the name of a check is equivalent to removing the old check and creating a new one in its place.
All checks run on full datasets, regardless of the incremental nature of the transform.
For example, let's assume we have a primary key check on the output of a transform running as incremental. Since data expectations checks are always run on the full dataset, the check will fail if a new primary key is included in the new transaction (which is about to be written, incrementally) and the same primary key has already been written (in a previous transaction).