Checks whether the most recent build of the schedule succeeded or failed.
Rule component | Description | Example options | Required? |
---|---|---|---|
Severity | Severity of check failure | Moderate, Critical | Y |
Escalate | Whether to escalate severity after consecutive failures | Y, N | N |
Notes | Add a note to provide additional context | Text | N |
Issues | Automatically create an issue when this check fails | Y, N | N |
A schedule status check is representative of the status of the pipeline or set of datasets that always build together. As a result, it will give a status across the various steps leading to the creation or update of this final dataset.
Checks whether the most recent build of the dataset succeeded or failed.
Rule component | Description | Example options | Required? |
---|---|---|---|
Severity | Severity of check failure | Moderate, Critical | Y |
Escalate | Whether to escalate severity after consecutive failures | Y, N | N |
Notes | Add a note to provide additional context | Text | N |
Issues | Automatically create an issue when this check fails | Y, N | N |
A build status check is representative of the status of the whole process leading to a final dataset to be built. As a result, it will give a status across the various steps leading to the creation or update of this final dataset. Note that if the intermediate datasets that are updated or created during the process also have a build status health check, these will not be updated. However, the job status will be updated for all these intermediate datasets.
Checks whether the most recent job run on a dataset succeeded or failed.
Rule component | Description | Example options | Required? |
---|---|---|---|
Severity | Severity of check failure | Moderate, Critical | Y |
Escalate | Whether to escalate severity after consecutive failures | Y, N | N |
Notes | Add a note to provide additional context | Text | N |
Issues | Automatically create an issue when this check fails | Y, N | N |
A job status check triggers independently from the build that causes the dataset to be refreshed or created. In other words, should the concerned dataset be the ultimate output of a given build or not, the job status check will run for each and every build of a particular dataset.
Use a build status check when the dataset is an output of a build and you want to check that the whole build on all datasets, including this dataset, succeeded. Use a job status check when the dataset is an intermediate dataset of the build and you want to check whether the dataset got updated, regardless of whether other datasets in the build were successfully updated.
Build status and job status will be equivalent if the dataset is the only output of a build. They may differ if the dataset is an intermediate dataset or if the build has multiple outputs, and the job on the dataset succeeds (or does not run), but other jobs in the build fail and cause the build to fail.
Checks whether the most recent sync of the dataset to another database succeeded or failed.
Rule component | Description | Example options | Required? |
---|---|---|---|
Sync destination | Which sync of the dataset to monitor, relevant especially when the dataset syncs to multiple destinations. | phonograph2-cache-worker , jdbc-worker | Y |
Severity | Severity of check failure | Moderate, Critical | Y |
Escalate | Whether to escalate severity after consecutive failures | Y, N | N |
Notes | Add a note to provide additional context | Text | N |
Issues | Automatically create an issue when this check fails | Y, N | N |
Checks whether the total time a build takes to complete meets some threshold.
Rule component | Description | Example options | Required? |
---|---|---|---|
Build duration | Total time a build takes to complete (in days, minutes, or hours) | Between 1 and 2 , Greater than or equal to 1 , Less than or equal to 1 , Equal to 1 | N |
Median deviation | Difference (in approximate standard deviations) from the median time to complete recent builds | 1 Standard deviations, 10 Recent builds | N |
Severity | Severity of check failure | Moderate, Critical | Y |
Notes | Add a note to provide additional context | Text | N |
Issues | Automatically create an issue when this check fails | Y, N | N |
As for the build status check, the build duration check will only be updated for the terminal output of the build. The intermediate datasets that are part of a larger build and have a build duration check attached to them will not be updated.
Checks the time of the latest transaction on a dataset against the maximum value of a timestamp column. If the timestamp in the column represents when the row was added, this can be used to measure exact data freshness.
Rule component | Description | Example options | Required? |
---|---|---|---|
Column name | Column name of the column containing the time of the last update. | LAST_UPDATED | Y |
Freshness range | Time range during which to consider the column's latest data as "fresh" (in days, minutes, or hours) | Between 1 and 2 , Greater than or equal to 1 , Less than or equal to 1 , Equal to 1 | Y |
Severity | Severity of check failure | Moderate, Critical | Y |
Notes | Add a note to provide additional context | Text | N |
Issues | Automatically create an issue when this check fails | Y, N | N |
Checks whether the total time a sync takes to complete meets some threshold.
Rule component | Description | Example options | Required? |
---|---|---|---|
Sync destination | Which sync of the dataset to monitor, relevant especially when the dataset syncs to multiple destinations. | phonograph2-cache-worker , jdbc-worker | Y |
Sync duration | Total time a sync takes to complete (in days, minutes, or hours) | Between 1 and 2 , Greater than or equal to 1 , Less than or equal to 1 , Equal to 1 | N |
Median deviation | Difference (in approximate standard deviations) from the median time to complete recent syncs | 1 Standard deviations, 10 Recent builds | N |
Severity | Severity of check failure | Moderate, Critical | Y |
Notes | Add a note to provide additional context | Text | N |
Issues | Automatically create an issue when this check fails | Y, N | N |
Checks the time of the latest sync of a dataset against the maximum value of a datetime column. If the timestamp in the column represents when the row was added, this can be used to measure exact data freshness.
Rule component | Description | Example options | Required? |
---|---|---|---|
Column name | Column name of the column containing the time of the last update. | LAST_UPDATED | Y |
Freshness range | Time range during which to consider the column's latest data as "fresh" (in days, minutes, or hours) | Between 1 and 2 , Greater than or equal to 1 , Less than or equal to 1 , Equal to 1 | Y |
Severity | Severity of check failure | Moderate, Critical | Y |
Notes | Add a note to provide additional context | Text | N |
Issues | Automatically create an issue when this check fails | Y, N | N |
Checks whether the total time since the dataset has updated (had a new transaction) meets some threshold.
Rule component | Description | Example options | Required? |
---|---|---|---|
Last updated | Total time since the dataset has updated (in days, minutes, or hours) | Between 1 and 2 , Greater than or equal to 1 , Less than or equal to 1 , Equal to 1 | N |
Median deviation | Difference (in approximate standard deviations) from the median update time of recent builds | 1 Standard deviations, 10 Recent builds | N |
Ignore empty transactions | Whether to exclude empty transactions when checking time since updated/median deviation. Transactions with no files will be ignored, as if they had not existed | Y, N | Y |
Severity | Severity of check failure | Moderate, Critical | Y |
Schedule | Schedule check to run automatically or manually | Automatic, Custom Schedule | Y |
Notes | Add a note to provide additional context | Text | N |
Issues | Automatically create an issue when this check fails | Y, N | N |
Checks whether the total time since the dataset last synced to some destination meets some threshold.
Rule component | Description | Example options | Required? |
---|---|---|---|
Last Sync | Total time since the dataset last synced to some destination (in days, minutes, or hours) | Between 1 and 2 , Greater than or equal to 1 , Less than or equal to 1 , Equal to 1 | N |
Median deviation | Difference (in approximate standard deviations) from the median update time of recent builds | 1 Standard deviations, 10 Recent builds | N |
Severity | Severity of check failure | Moderate, Critical | Y |
Notes | Add a note to provide additional context | Text | N |
Issues | Automatically create an issue when this check fails | Y, N | N |
Checks the total number of files in the latest view of the dataset.
Rule component | Description | Example options | Required? |
---|---|---|---|
File count | Total number of files in the most recent view of a dataset | Between 1 and 2 , Greater than or equal to 1 , Less than or equal to 1 , Equal to 1 | Y |
Severity | Severity of check failure | Moderate, Critical | Y |
Median deviation | Difference (in approximate standard deviations) from the median number of files in recent builds | 1 Standard deviations, 10 Recent builds | N |
Notes | Add a note to provide additional context | Text | N |
Issues | Automatically create an issue when this check fails | Y, N | N |
Checks if the partitioning of the dataset is performant.
Rule component | Description | Example options | Required? |
---|---|---|---|
Notes | The partitioning check works as follows: - If there are less than 50 files in total, the check always passes. - If there are 50 or more files in total, the check passes if at least 90% of the files are more than 96MB in size. If the check fails, it means that the partitioning of the data across files is sub-optimal for performance and the data needs to be partitioned better. | No options to configure | N |
Issues | Automatically create an issue when this check fails | Y, N | N |
Checks the total number of rows in the dataset.
Rule component | Description | Example options | Required? |
---|---|---|---|
Row count | Total number of rows in a dataset | Between 500 and 1000 , Greater than or equal to 100 , Less than or equal to 1000 , Equal to 10 | Y |
Severity | Severity of check failure | Moderate, Critical | Y |
Median deviation | Difference (in approximate standard deviations) from the median row count in recent builds | 1 Standard deviations, 10 Recent builds | N |
Notes | Add a note to provide additional context | Text | N |
Issues | Automatically create an issue when this check fails | Y, N | N |
If the row count check is set against the last successful check result, the check will evaluate the criteria according to the row count recorded in the previous passing check, and will not consider the results in failed checks.
Checks the total number of files committed in one transaction, excluding log files.
Rule component | Description | Example options | Required? |
---|---|---|---|
File size | Total number of files committed in a transaction | Between 1 and 2 , Greater than or equal to 1 , Less than or equal to 1 , Equal to 1 | N |
Severity | Severity of check failure | Moderate, Critical | Y |
Median deviation | Difference (in approximate standard deviations) from the median number of files in recent builds | 1 Standard deviations, 10 Recent builds | N |
Notes | Add a note to provide additional context | Text | N |
Issues | Automatically create an issue when this check fails | Y, N | N |
Checks the total size of the files committed in one transaction, excluding log files.
Rule component | Description | Example options | Required? |
---|---|---|---|
File size | Total size of all files committed in a transaction (in MB or KB ) | Between 1 and 2 , Greater than or equal to 1 , Less than or equal to 1 , Equal to 1 | N |
Severity | Severity of check failure | Moderate, Critical | Y |
Median deviation | Difference (in approximate standard deviations) from the median file size in recent builds | 1 Standard deviations, 10 Recent builds | N |
Notes | Add a note to provide additional context | Text | N |
Issues | Automatically create an issue when this check fails | Y, N | N |
Checks if the values in a column match a list of allowed values.
Rule component | Description | Example options | Required? |
---|---|---|---|
Column name | Column name to check against | FIRST_NAME | Y |
Allowed values | Allowed possible values for above column | John , Jane | Y |
Severity | Severity of check failure | Moderate, Critical | Y |
Notes | Add a note to provide additional context | Text | N |
Issues | Automatically create an issue when this check fails | Y, N | N |
Checks what percentage of values in a column are unique. The percentage is approximate. Note this means this check is not suitable for checking if a column is a primary key (100% unique values), use the primary key check instead.
Rule component | Description | Example options | Required? |
---|---|---|---|
Column name | Column name to check against | FIRST_NAME | Y |
Unique percentage | Values that are unique in the column (in % ) | Between 10 and 20 , Greater than or equal to 50 , Less than or equal to 50 , Equal to 1 | Y |
Severity | Severity of check failure | Moderate, Critical | Y |
Notes | Add a note to provide additional context | Text | N |
Issues | Automatically create an issue when this check fails | Y, N | N |
Checks if the values in a column match a certain regular expression.
Rule component | Description | Example options | Required? |
---|---|---|---|
Column name | Column name to check | FIRST_NAME | Y |
Regex | Regular expression the column should match | ^Pre , post$ , .*any.* | Y |
Severity | Severity of check failure | Moderate, Critical | Y |
Notes | Add a note to provide additional context | Text | N |
Issues | Automatically create an issue when this check fails | Y, N | N |
This check provides an estimate of similarity between two columns as a percentage. For an exact check, use data expectations instead.
Rule component | Description | Example options | Required? |
---|---|---|---|
Other dataset | Dataset to check against | /Users/John Appleseed/Stock_Prices_Latest | Y |
Column 1 name | Column name of the dataset on which the check is set | FIRST_NAME | Y |
Column 2 name | Column name of the other dataset | f_name | Y |
Percentage match | To what extent the two columns must match (in % ) | 85% of values are equal | Y |
Notes | Add a note to provide additional context | Text | N |
Issues | Automatically create an issue when this check fails | Y, N | N |
Checks for the range of values in a date column.
Rule component | Description | Example options | Required? |
---|---|---|---|
Column name | Name of the column to check | LAST_UPDATED | Y |
Allowed date range | Allowed date range for the column | 2017-01-01 – 2018-01-01 | Y |
Notes | Add a note to provide additional context | Text | N |
Issues | Automatically create an issue when this check fails | Y, N | N |
Checks what percentage of values in a column are null.
Rule Component | Description | Example options | Required? |
---|---|---|---|
Column name | Name of the column to check | CUSTOMER_ID | Y |
Null percentage | Percentage of values that are null in the column (in % ) | Between 1 and 2 , Greater than or equal to 1 , Less than or equal to 1 , Equal to 1 | N |
Severity | Severity of check failure | Moderate, Critical | Y |
Median deviation | Difference (in approximate standard deviations) from the median null percentage of recent builds | 1 Standard deviations, 10 Recent builds | N |
Notes | Add a note to provide additional context | Text | N |
Issues | Automatically create an issue when this check fails | Y, N | N |
Checks whether the average of a numeric column meets some threshold.
Rule component | Description | Example options | Required? |
---|---|---|---|
Column name | Name of the numeric column to check | NUM_FAILURES | Y |
Mean | Desired mean of the column | Between 1 and 2 , Greater than or equal to 1 , Less than or equal to 1 , Equal to 1 | N |
Severity | Severity of check failure | Moderate, Critical | Y |
Difference from last check | Compare the current mean of the column to the mean of the column at the last check run, ± an optional constant | Greater than the last check + 5 | N |
Notes | Add a note to provide additional context | Text | N |
Issues | Automatically create an issue when this check fails | Y, N | N |
Checks whether the median of a numeric column meets some threshold.
Rule component | Description | Example options | Required? |
---|---|---|---|
Column name | Name of the numeric column to check | NUM_FAILURES | Y |
Median | Desired median of the column | Between 1 and 2 , Greater than or equal to 1 , Less than or equal to 1 , Equal to 1 | N |
Severity | Severity of check failure | Moderate, Critical | Y |
Difference from last check | Compare the current mean of the column to the mean of the column at the last check run, ± an optional constant | Greater than the last check + 5 | N |
Notes | Add a note to provide additional context | Text | N |
Issues | Automatically create an issue when this check fails | Y, N | N |
Checks the range of values in a numeric column.
Rule component | Description | Example options | Required? |
---|---|---|---|
Column name | Name of the numeric column to check | NUM_FAILURES | Y |
Allowed range | Allowed range for the column | 3-5 | Y |
Severity | Severity of check failure | Moderate, Critical | Y |
Notes | Add a note to provide additional context | Text | N |
Issues | Automatically create an issue when this check fails | Y, N | N |
Checks that the values in a column are 100% unique and non-null.
Rule component | Description | Example options | Required? |
---|---|---|---|
Column name | Name of the column to check | PART_ID | Y |
Notes | Add a note to provide additional context | Text | N |
Issues | Automatically create an issue when this check fails | Y, N | N |
Checks for the existence and type of a column.
Rule component | Description | Example options | Required? |
---|---|---|---|
Column Name | Name of the column to check for | PART_ID | Y |
Is Present | Check existence of column | Y | Y |
Type | Type of the column | Integer | Y |
Notes | Add a note to provide additional context | Text | N |
Issues | Automatically create an issue when this check fails | Y, N | N |
Checks for the total number of columns in the dataset.
Rule component | Description | Example options | Required? |
---|---|---|---|
Column count | Total number of columns in the dataset | 50 | Y |
Notes | Add a note to provide additional context | Text | N |
Issues | Automatically create an issue when this check fails | Y, N | N |
Checks the dataset schema, verifying that the schema is respecting the chosen comparison type (see below for more details on the available ones).
Rule component | Description | Example options | Required? |
---|---|---|---|
Columns | Enumerating the dataset columns and types - can choose full type match or column existence only | Type: String | Y |
Comparison type | Specify which comparison policy will be used | Text | Y |
Notes | Add a note to provide additional context | Text | N |
Issues | Automatically create an issue when this check fails | Y, N | N |
Available schema check types are the following:
Value | Comparison allowance |
---|---|
EXACT_MATCH_ORDERED_COLUMNS | Checks column order, names and types, and number of columns. |
EXACT_MATCH_UNORDERED_COLUMNS | Checks column names and types, and number of columns. Order does not matter. |
COLUMN_ADDITIONS_ALLOWED | Checks column names and types. Extra columns are allowed, but columns cannot be missing. |
COLUMN_ADDITIONS_ALLOWED_STRICT | Like COLUMN_ADDITIONS_ALLOWED ; however, whenever a new column is added to the dataset, that column is added to the check. Added columns cannot be missing thereafter. |
Since dataset builds can easily have outliers, we do not use the true standard deviation. Instead, we use the median absolute deviation (MAD) which is a more robust measure of variability.
The MAD is defined as the median of the absolute deviations from the median of the data. For values x_1, ..., x_n
with median X
this means MAD = median(|x_i - X|)
.
The median absolute deviation can be used to approximate standard deviation by multiplying with a constant.
Our calculation is σ = MAD * 1.4826
.
For detailed information see Median Absolute Deviation - Wikipedia ↗.