Checks reference

Status checks

Schedule status

Checks whether the most recent build of the schedule succeeded or failed.

Rule componentDescriptionExample optionsRequired?
SeveritySeverity of check failureModerate, CriticalY
EscalateWhether to escalate severity after consecutive failuresY, NN
NotesAdd a note to provide additional contextTextN
IssuesAutomatically create an issue when this check failsY, NN

A schedule status check is representative of the status of the pipeline or set of datasets that always build together. As a result, it will give a status across the various steps leading to the creation or update of this final dataset.

Build status

Checks whether the most recent build of the dataset succeeded or failed.

Rule componentDescriptionExample optionsRequired?
SeveritySeverity of check failureModerate, CriticalY
EscalateWhether to escalate severity after consecutive failuresY, NN
NotesAdd a note to provide additional contextTextN
IssuesAutomatically create an issue when this check failsY, NN

A build status check is representative of the status of the whole process leading to a final dataset to be built. As a result, it will give a status across the various steps leading to the creation or update of this final dataset. Note that if the intermediate datasets that are updated or created during the process also have a build status health check, these will not be updated. However, the job status will be updated for all these intermediate datasets.

Job status

Checks whether the most recent job run on a dataset succeeded or failed.

Rule componentDescriptionExample optionsRequired?
SeveritySeverity of check failureModerate, CriticalY
EscalateWhether to escalate severity after consecutive failuresY, NN
NotesAdd a note to provide additional contextTextN
IssuesAutomatically create an issue when this check failsY, NN

A job status check triggers independently from the build that causes the dataset to be refreshed or created. In other words, should the concerned dataset be the ultimate output of a given build or not, the job status check will run for each and every build of a particular dataset.

When to use job status or build status checks

Use a build status check when the dataset is an output of a build and you want to check that the whole build on all datasets, including this dataset, succeeded. Use a job status check when the dataset is an intermediate dataset of the build and you want to check whether the dataset got updated, regardless of whether other datasets in the build were successfully updated.

Build status and job status will be equivalent if the dataset is the only output of a build. They may differ if the dataset is an intermediate dataset or if the build has multiple outputs, and the job on the dataset succeeds (or does not run), but other jobs in the build fail and cause the build to fail.

Sync status

Checks whether the most recent sync of the dataset to another database succeeded or failed.

Rule componentDescriptionExample optionsRequired?
Sync destinationWhich sync of the dataset to monitor, relevant especially when the dataset syncs to multiple destinations.phonograph2-cache-worker, jdbc-workerY
SeveritySeverity of check failureModerate, CriticalY
EscalateWhether to escalate severity after consecutive failuresY, NN
NotesAdd a note to provide additional contextTextN
IssuesAutomatically create an issue when this check failsY, NN

Time checks

Build duration

Checks whether the total time a build takes to complete meets some threshold.

Rule componentDescriptionExample optionsRequired?
Build durationTotal time a build takes to complete (in days, minutes, or hours)Between 1 and 2, Greater than or equal to 1, Less than or equal to 1, Equal to 1N
Median deviationDifference (in approximate standard deviations) from the median time to complete recent builds1 Standard deviations, 10 Recent buildsN
SeveritySeverity of check failureModerate, CriticalY
NotesAdd a note to provide additional contextTextN
IssuesAutomatically create an issue when this check failsY, NN

As for the build status check, the build duration check will only be updated for the terminal output of the build. The intermediate datasets that are part of a larger build and have a build duration check attached to them will not be updated.

Data freshness

Checks the time of the latest transaction on a dataset against the maximum value of a timestamp column. If the timestamp in the column represents when the row was added, this can be used to measure exact data freshness.

Rule componentDescriptionExample optionsRequired?
Column nameColumn name of the column containing the time of the last update.LAST_UPDATEDY
Freshness rangeTime range during which to consider the column's latest data as "fresh" (in days, minutes, or hours)Between 1 and 2, Greater than or equal to 1, Less than or equal to 1, Equal to 1Y
SeveritySeverity of check failureModerate, CriticalY
NotesAdd a note to provide additional contextTextN
IssuesAutomatically create an issue when this check failsY, NN

Sync duration

Checks whether the total time a sync takes to complete meets some threshold.

Rule componentDescriptionExample optionsRequired?
Sync destinationWhich sync of the dataset to monitor, relevant especially when the dataset syncs to multiple destinations.phonograph2-cache-worker, jdbc-workerY
Sync durationTotal time a sync takes to complete (in days, minutes, or hours)Between 1 and 2, Greater than or equal to 1, Less than or equal to 1, Equal to 1N
Median deviationDifference (in approximate standard deviations) from the median time to complete recent syncs1 Standard deviations, 10 Recent buildsN
SeveritySeverity of check failureModerate, CriticalY
NotesAdd a note to provide additional contextTextN
IssuesAutomatically create an issue when this check failsY, NN

Sync freshness

Checks the time of the latest sync of a dataset against the maximum value of a datetime column. If the timestamp in the column represents when the row was added, this can be used to measure exact data freshness.

Rule componentDescriptionExample optionsRequired?
Column nameColumn name of the column containing the time of the last update.LAST_UPDATEDY
Freshness rangeTime range during which to consider the column's latest data as "fresh" (in days, minutes, or hours)Between 1 and 2, Greater than or equal to 1, Less than or equal to 1, Equal to 1Y
SeveritySeverity of check failureModerate, CriticalY
NotesAdd a note to provide additional contextTextN
IssuesAutomatically create an issue when this check failsY, NN

Time since last updated

Checks whether the total time since the dataset has updated (had a new transaction) meets some threshold.

Rule componentDescriptionExample optionsRequired?
Last updatedTotal time since the dataset has updated (in days, minutes, or hours)Between 1 and 2, Greater than or equal to 1, Less than or equal to 1, Equal to 1N
Median deviationDifference (in approximate standard deviations) from the median update time of recent builds1 Standard deviations, 10 Recent buildsN
Ignore empty transactionsWhether to exclude empty transactions when checking time since updated/median deviation. Transactions with no files will be ignored, as if they had not existedY, NY
SeveritySeverity of check failureModerate, CriticalY
ScheduleSchedule check to run automatically or manuallyAutomatic, Custom ScheduleY
NotesAdd a note to provide additional contextTextN
IssuesAutomatically create an issue when this check failsY, NN

Time since sync last updated

Checks whether the total time since the dataset last synced to some destination meets some threshold.

Rule componentDescriptionExample optionsRequired?
Last SyncTotal time since the dataset last synced to some destination (in days, minutes, or hours)Between 1 and 2, Greater than or equal to 1, Less than or equal to 1, Equal to 1N
Median deviationDifference (in approximate standard deviations) from the median update time of recent builds1 Standard deviations, 10 Recent buildsN
SeveritySeverity of check failureModerate, CriticalY
NotesAdd a note to provide additional contextTextN
IssuesAutomatically create an issue when this check failsY, NN

Size checks

Dataset file count

Checks the total number of files in the latest view of the dataset.

Rule componentDescriptionExample optionsRequired?
File countTotal number of files in the most recent view of a datasetBetween 1 and 2, Greater than or equal to 1, Less than or equal to 1, Equal to 1Y
SeveritySeverity of check failureModerate, CriticalY
Median deviationDifference (in approximate standard deviations) from the median number of files in recent builds1 Standard deviations, 10 Recent buildsN
NotesAdd a note to provide additional contextTextN
IssuesAutomatically create an issue when this check failsY, NN

Dataset partition

Checks if the partitioning of the dataset is performant.

Rule componentDescriptionExample optionsRequired?
NotesThe partitioning check works as follows:
- If there are less than 50 files in total, the check always passes.
- If there are 50 or more files in total, the check passes if at least 90% of the files are more than 96MB in size.

If the check fails, it means that the partitioning of the data across files is sub-optimal for performance and the data needs to be partitioned better.
No options to configureN
IssuesAutomatically create an issue when this check failsY, NN

Row count

Checks the total number of rows in the dataset.

Rule componentDescriptionExample optionsRequired?
Row countTotal number of rows in a datasetBetween 500 and 1000, Greater than or equal to 100, Less than or equal to 1000, Equal to 10Y
SeveritySeverity of check failureModerate, CriticalY
Median deviationDifference (in approximate standard deviations) from the median row count in recent builds1 Standard deviations, 10 Recent buildsN
NotesAdd a note to provide additional contextTextN
IssuesAutomatically create an issue when this check failsY, NN

If the row count check is set against the last successful check result, the check will evaluate the criteria according to the row count recorded in the previous passing check, and will not consider the results in failed checks.

Transaction file count

Checks the total number of files committed in one transaction, excluding log files.

Rule componentDescriptionExample optionsRequired?
File sizeTotal number of files committed in a transactionBetween 1 and 2, Greater than or equal to 1, Less than or equal to 1, Equal to 1N
SeveritySeverity of check failureModerate, CriticalY
Median deviationDifference (in approximate standard deviations) from the median number of files in recent builds1 Standard deviations, 10 Recent buildsN
NotesAdd a note to provide additional contextTextN
IssuesAutomatically create an issue when this check failsY, NN

Transaction file size

Checks the total size of the files committed in one transaction, excluding log files.

Rule componentDescriptionExample optionsRequired?
File sizeTotal size of all files committed in a transaction (in MB or KB)Between 1 and 2, Greater than or equal to 1, Less than or equal to 1, Equal to 1N
SeveritySeverity of check failureModerate, CriticalY
Median deviationDifference (in approximate standard deviations) from the median file size in recent builds1 Standard deviations, 10 Recent buildsN
NotesAdd a note to provide additional contextTextN
IssuesAutomatically create an issue when this check failsY, NN

Content checks

Allowed column values

Checks if the values in a column match a list of allowed values.

Rule componentDescriptionExample optionsRequired?
Column nameColumn name to check againstFIRST_NAMEY
Allowed valuesAllowed possible values for above columnJohn, JaneY
SeveritySeverity of check failureModerate, CriticalY
NotesAdd a note to provide additional contextTextN
IssuesAutomatically create an issue when this check failsY, NN

Approximate unique percentage

Checks what percentage of values in a column are unique. The percentage is approximate. Note this means this check is not suitable for checking if a column is a primary key (100% unique values), use the primary key check instead.

Rule componentDescriptionExample optionsRequired?
Column nameColumn name to check againstFIRST_NAMEY
Unique percentageValues that are unique in the column (in %)Between 10 and 20, Greater than or equal to 50, Less than or equal to 50, Equal to 1Y
SeveritySeverity of check failureModerate, CriticalY
NotesAdd a note to provide additional contextTextN
IssuesAutomatically create an issue when this check failsY, NN

Column regex

Checks if the values in a column match a certain regular expression.

Rule componentDescriptionExample optionsRequired?
Column nameColumn name to checkFIRST_NAMEY
RegexRegular expression the column should match^Pre, post$, .*any.*Y
SeveritySeverity of check failureModerate, CriticalY
NotesAdd a note to provide additional contextTextN
IssuesAutomatically create an issue when this check failsY, NN

Approximate column relation (deprecated)

This check provides an estimate of similarity between two columns as a percentage. For an exact check, use data expectations instead.

Rule componentDescriptionExample optionsRequired?
Other datasetDataset to check against/Users/John Appleseed/Stock_Prices_LatestY
Column 1 nameColumn name of the dataset on which the check is setFIRST_NAMEY
Column 2 nameColumn name of the other datasetf_nameY
Percentage matchTo what extent the two columns must match (in %)85% of values are equalY
NotesAdd a note to provide additional contextTextN
IssuesAutomatically create an issue when this check failsY, NN

Date range

Checks for the range of values in a date column.

Rule componentDescriptionExample optionsRequired?
Column nameName of the column to checkLAST_UPDATEDY
Allowed date rangeAllowed date range for the column2017-01-01 – 2018-01-01Y
NotesAdd a note to provide additional contextTextN
IssuesAutomatically create an issue when this check failsY, NN

Null percentage

Checks what percentage of values in a column are null.

Rule ComponentDescriptionExample optionsRequired?
Column nameName of the column to checkCUSTOMER_IDY
Null percentagePercentage of values that are null in the column (in %)Between 1 and 2, Greater than or equal to 1, Less than or equal to 1, Equal to 1N
SeveritySeverity of check failureModerate, CriticalY
Median deviationDifference (in approximate standard deviations) from the median null percentage of recent builds1 Standard deviations, 10 Recent buildsN
NotesAdd a note to provide additional contextTextN
IssuesAutomatically create an issue when this check failsY, NN

Numeric mean

Checks whether the average of a numeric column meets some threshold.

Rule componentDescriptionExample optionsRequired?
Column nameName of the numeric column to checkNUM_FAILURESY
MeanDesired mean of the columnBetween 1 and 2, Greater than or equal to 1, Less than or equal to 1, Equal to 1N
SeveritySeverity of check failureModerate, CriticalY
Difference from last checkCompare the current mean of the column to the mean of the column at the last check run, ± an optional constantGreater than the last check + 5N
NotesAdd a note to provide additional contextTextN
IssuesAutomatically create an issue when this check failsY, NN

Numeric median

Checks whether the median of a numeric column meets some threshold.

Rule componentDescriptionExample optionsRequired?
Column nameName of the numeric column to checkNUM_FAILURESY
MedianDesired median of the columnBetween 1 and 2, Greater than or equal to 1, Less than or equal to 1, Equal to 1N
SeveritySeverity of check failureModerate, CriticalY
Difference from last checkCompare the current mean of the column to the mean of the column at the last check run, ± an optional constantGreater than the last check + 5N
NotesAdd a note to provide additional contextTextN
IssuesAutomatically create an issue when this check failsY, NN

Numeric range

Checks the range of values in a numeric column.

Rule componentDescriptionExample optionsRequired?
Column nameName of the numeric column to checkNUM_FAILURESY
Allowed rangeAllowed range for the column3-5Y
SeveritySeverity of check failureModerate, CriticalY
NotesAdd a note to provide additional contextTextN
IssuesAutomatically create an issue when this check failsY, NN

Primary key

Checks that the values in a column are 100% unique and non-null.

Rule componentDescriptionExample optionsRequired?
Column nameName of the column to checkPART_IDY
NotesAdd a note to provide additional contextTextN
IssuesAutomatically create an issue when this check failsY, NN

Schema checks

Column

Checks for the existence and type of a column.

Rule componentDescriptionExample optionsRequired?
Column NameName of the column to check forPART_IDY
Is PresentCheck existence of columnYY
TypeType of the columnIntegerY
NotesAdd a note to provide additional contextTextN
IssuesAutomatically create an issue when this check failsY, NN

Column count

Checks for the total number of columns in the dataset.

Rule componentDescriptionExample optionsRequired?
Column countTotal number of columns in the dataset50Y
NotesAdd a note to provide additional contextTextN
IssuesAutomatically create an issue when this check failsY, NN

Schema

Checks the dataset schema, verifying that the schema is respecting the chosen comparison type (see below for more details on the available ones).

Rule componentDescriptionExample optionsRequired?
ColumnsEnumerating the dataset columns and types - can choose full type match or column existence onlyType: StringY
Comparison typeSpecify which comparison policy will be usedTextY
NotesAdd a note to provide additional contextTextN
IssuesAutomatically create an issue when this check failsY, NN

Available schema check types are the following:

ValueComparison allowance
EXACT_MATCH_ORDERED_COLUMNSChecks column order, names and types, and number of columns.
EXACT_MATCH_UNORDERED_COLUMNSChecks column names and types, and number of columns. Order does not matter.
COLUMN_ADDITIONS_ALLOWEDChecks column names and types. Extra columns are allowed, but columns cannot be missing.
COLUMN_ADDITIONS_ALLOWED_STRICTLike COLUMN_ADDITIONS_ALLOWED; however, whenever a new column is added to the dataset, that column is added to the check. Added columns cannot be missing thereafter.

Approximate standard deviation

Since dataset builds can easily have outliers, we do not use the true standard deviation. Instead, we use the median absolute deviation (MAD) which is a more robust measure of variability.

The MAD is defined as the median of the absolute deviations from the median of the data. For values x_1, ..., x_n with median X this means MAD = median(|x_i - X|).

The median absolute deviation can be used to approximate standard deviation by multiplying with a constant.

Our calculation is σ = MAD * 1.4826.

For detailed information see Median Absolute Deviation - Wikipedia ↗.