This document provides best practices for setting up health checks to monitor the health of your pipelines. Following these guidelines should allow you to achieve a robust and effective level of monitoring to ensure: data gets in, data gets built, and data gets out.
These best practices will not cover ensuring the quality, accuracy or validity of the content within the datasets. This requires more granular and functional knowledge of the pipeline in order to determine the correct validations to be done within the pipeline.
These guidelines should also help you avoid these common pitfalls with health check set-ups:
These guidelines rely on an understanding of:
In this document, references to a schedule's inputs, intermediates, and outputs refer to the resolved schedule, which is not the same as the schedule configuration in the Data Lineage application.
A resolved schedule is a mental model to assign roles to the different datasets involved in a schedule. Some datasets will be involved because they can be built by the schedule, meaning they are a part of the dataset selection of the schedule. Other datasets are involved as required inputs needed for a build. Different health checks are recommended depending on the role of a dataset.
Datasets can have one of the following roles in a schedule:
In a concrete example, imagine that a schedule builds the following datasets:
In this case, you can split the schedule as such:
To determine what the inputs, outputs and intermediates of a schedule are, the easiest way is to open the schedule in Data Lineage. Once in Data Lineage, select the schedule from the sidebar, which will apply schedule coloring to help you understand what the schedule will attempt to build.
Target and will attempt building are usually built by the schedule. One exception is for Data Connection synced datasets, which will only build if "force build" is set on the schedule.
Excludes are never built by the schedule.
Inputs (connecting build only) are not built by the schedule UNLESS they have another input upstream of them.
A short aside on staleness: In practice, the schedules rarely builds everything in this graph, since some datasets might already be up-to-date, and re-computing them would just waste resources. However, it's still important to understand that resolving a schedule means figuring out everything that the schedule can touch.
Schedules are defined on "targets", and those are usually the same as "outputs". However, there are cases where targets and outputs can be different:
(1) A dataset can be an "output" without being explicitly defined as a "target":
A schedule that builds output_c
will always have to build output_d
as well, since the transform between B, C and D is a multi-output transform.
Therefore, a schedule that targets output_c
will have both output_c
and output_d
as outputs, since output_d
is a dataset built by the schedule that is not used by any other datasets in the schedule.
(2) A dataset can be defined as a "target" and not be an "output":
Even if a dataset is defined as a schedule "target", if it is used by other datasets in the schedule, it is considered an "intermediate" dataset instead of an "output" dataset.
In this example, dataset_c is a schedule "target", but is not considered an "output":
The following step-by-step guide relies on an understanding of Job vs Build Status checks, and Sync vs Data Freshness vs Time Since Last Updated checks. If you are not sure about the difference between these checks, see the health check types here.
Schedules give us a sensible representation of a pipeline. As they are the recommended unit of monitoring, your monitoring will only be as good as the schedules you set up. Take some time before you start setting up your health checks to make sure your schedules adhere to the best practices outlined here.
Install checks on all resolved inputs of your pipelines. If your pipeline fails, it's important to be able to trace down the root cause. Input staleness or schema breaks happen — installing checks on your inputs will help you detect them. Note: for the time being, only one check of a given type can exist on a particular dataset. If a check you wanted to install already exists, just subscribe to it.
Install checks on all resolved outputs of your pipeline (recall that these are built by the schedule, but are not used by any other datasets in your schedule).
Optionally, install checks on important intermediate datasets that are consumed by users in another application directly or via syncs:
The best practices explained above are summarized in this table for quick reference:
Build Status | Schema | Build Duration | TSLU | Data Freshness | Sync Freshness | Sync Status | |
---|---|---|---|---|---|---|---|
Input | ✓ | ✓ (allow additions) | |||||
Intermediate | |||||||
Output | ✓ | ✓ (exact match) | ✓ | ✓ | |||
User-facing datasets* | ✓ (exact match) | ✓ | |||||
Synced Datasets* | ✓ (exact match) | ✓ | ✓ | ✓ |
[*] Can be input, intermediate or output dataset. User-facing datasets are datasets consumed by users directly in apps, such as Contour.