Types of checks

This page outlines various types of checks available in Data Health, including job-level checks, build-level checks, and freshness checks.

Job-Level Checks vs Build-Level Checks

Understanding Job Status, Build Status & Build Duration

The definitions below clarify what a Job and a Build are in Foundry:

  • Job: a Spark computation defined by the logic in a single transform. In other words, a job is a single transform that produces a single dataset (or several if a multi-output transform is used). Jobs are broken down into a set of stages.
  • Build: a collection of jobs with defined target datasets (as defined in the schedule, or as you can see from the datasets listed on the main Builds application page).

We use the following Data Health Checks to ensure Jobs and Builds are running successfully:

  • Job status: this is triggered whenever the dataset on which it is installed is refreshed or is created as a part of any build. A job status check will succeed if the target dataset successfully builds, even if the build it is a part of fails downstream. However, note that if the build fails upstream of the target dataset, your target dataset will register as having a "Cancelled" build and the job status will not be evaluated for the target dataset.
  • Build duration & build status: these allow you to monitor the status of a build (including all intermediates). However, they are only triggered when installed on outputs (target datasets of a build).
    • These should only be installed on outputs. It does not make sense to install these checks on intermediates in a schedule, as they will never be triggered when the schedule builds.
    • In general, it is recommended that all schedules have build status installed on outputs. If you have a build status, it is not necessary nor recommended to install job status checks on other datasets built by the same schedule, as every job failure will make the build status checks go off.

When trying to determine when and where to place job status or build status checks, see our guide on what health checks to apply.

For more details and further clarification on the checks themselves, see the checks reference for build status and job status.

Freshness Checks

Understanding Sync Freshness, Data Freshness & Time Since Last Updated

All three of these checks are concerned with “freshness” (i.e. how up-to-date some aspect of your data is), but they all use different methods to evaluate freshness:

  • Time since last updated: this evaluates freshness of the dataset. It calculates how much time has elapsed between the current time and the last transaction committed (even if the transaction was empty. An empty transaction does not change the data in the dataset).
  • Data freshness: this evaluates freshness of the data in the dataset. It calculates how much time has elapsed between the last transaction committed and the maximum value of a timestamp column. This check is only run when a transaction is committed.
  • Sync freshness: this evaluates freshness of the data in the synced dataset (e.g. a Phonograph table). It calculates how much time has elapsed between the time of the latest sync of a dataset against the maximum value of a datetime column.

For both data and sync freshness, it is ideal if the timestamp in the column represents the time when the row was added in the source system.

When trying to determine when and where to place freshness checks, see our guide on what health checks to apply.

For more details on the checks themselves, see the checks reference for time since last updated, data freshness, and sync freshness.

Can I abort builds when Health Checks fail?

Most standard health checks depend on jobs to finish in order to compute. If your dataset is created in a Code Repository, you can use Data Expectations to define checks that run during build time. This will allow you to abort the build on error and monitor the checks using Data Health.