Monitoring rules reference

Monitoring rules are configured on a per resource basis, with rules for the following resources:

All monitoring rules contain a configurable field called Alert severity which is the severity granted to an alert when its condition is triggered. Monitoring views can be configured to only send out alerts that meet or exceed a certain severity.

Rule componentDescriptionExample options
Alert severitySeverity of monitoring report conditionLow, Medium, High

Agent rules

Agent last heartbeat time

Alerts when the agent bootstrapper's last heartbeat is older than a set threshold.

Rule componentDescriptionExample options
If value is greater thanAmount of time elapsed since the last heartbeat received from the agent bootstrapper10 minutes

We recommend setting this monitor value to 10 minutes.

Agent manager version stale time

Alerts when the agent bootstrapper version has not been upgraded since a set threshold.

Rule componentDescriptionExample options
If value is greater thanAmount of time elapsed since the agent manager has been on an old version10 minutes

We recommend setting this monitor value to 10 days.

Agent version stale time

Alerts when the agent version has not been upgraded since a set threshold.

Rule componentDescriptionExample options
If value is greater thanAmount of time elapsed since the agent has been on an older version10 days

We recommend setting this monitor value to 10 days.

High CPU utilization

Alerts when the agent CPU utilization exceeds a set threshold.

Rule componentDescriptionExample options
If value is greater thanPercentage of CPU utilization80

We recommend setting this monitor value to 80 (%).

JVM heap usage is close to the limit

Alerts when the JVM heap usage exceeds a set threshold.

Rule componentDescriptionExample options
If value is greater thanPercentage of JVM heap used / JVM heap available70

We recommend setting this monitor value to 70 (%).

Low disk space

Alerts when the available disk space drops below a set threshold.

Rule componentDescriptionExample options
If value is less thanAvailable disk space10GB

We recommend setting this monitor value to 10GB.

Time until earliest keystore certificate expires

Alerts when a certificate in the agent's keystore will expire within a set threshold.

Rule componentDescriptionExample options
If value is less thanAmount of time until a certificate expires10 days

** We recommend setting this monitor value to medium severity at less than 30 days and high severity at less than 10 days.**

Time until earliest truststore certificate expires

Alerts when a certificate in the agent's truststore will expire within a set threshold.

Rule componentDescriptionExample options
If value is less thanAmount of time until a certificate expires10 days

We recommend setting this monitor value to medium severity at less than 30 days and high severity at less than 10 days.

Queue size

Alerts when the number of jobs queued on an agent exceeds a set threshold.

Rule componentDescriptionExample options
If value is greater thanThe number of jobs in the agent's job queue70

We recommend setting this monitor value to 70 (jobs).

Schedule rules

Consecutive schedule failures

Alerts when the number of consecutive schedule failures meets or exceeds a set threshold. This does not count schedule runs that result in a cancelled build.

Rule componentDescriptionExample options
If value is greater than or equal toThreshold of consecutive schedule failures1

The default behavior for this monitor is to alert with medium severity at one failure and high severity at three failures, though these thresholds are highly dependent on the frequency and stability of the schedules that are included in the monitoring rule's scope.

Schedule duration

Alerts when a schedule is running longer than a set threshold.

Rule componentDescriptionExample options
If value is greater than or equal toThe duration of the schedule2 hours
This monitor is typically used on highly critical schedules to quickly inform whether or not the schedule will complete in the expected time. Due to the variable nature of schedules, this monitor is often schedule-scoped.

Changelog jobs failing on active pipeline

Alerts when the "changelog" job for the object or link's active pipeline is failing. This rule is non-configurable, alerting with medium severity at one changelog job failure and high severity at three changelog job failures.

Merge changes job failing on active pipeline

Alerts when the "merge changes" job for the object or link's active pipeline is failing.

Rule componentDescriptionExample options
If value is greater than or equal toThreshold of consecutive merge job failures3

The default behavior for this monitor is to alert with low severity at one failure, medium severity at three failures, and high severity at seven failures.

Sync jobs failing on active pipeline

Alerts when the "sync" job for the object or link's active pipeline is failing.

Rule componentDescriptionExample options
If value is greater than or equal toThreshold of consecutive sync job failures3

The default behavior for this monitor is to alert with low severity at one failure, medium severity at three failures, and high severity at seven failures.

Changelog jobs failing on replacement pipeline

Alerts when the "changelog" job for the object or link's replacement pipeline is failing. This rule is non-configurable, alerting with medium severity at one changelog job failure and high severity at three changelog job failures.

Merge changes job failing on replacement pipeline

Alerts when the "merge changes" job for the object or link's replacement pipeline is failing.

Rule componentDescriptionExample options
If value is greater than or equal toThreshold of consecutive merge job failures3

The default behavior for this monitor is to alert with low severity at one failure, medium severity at three failures, and high severity at seven failures.

Sync jobs failing on replacement pipeline

Alerts when the "sync" job for the object or link's replacement pipeline is failing.

Rule componentDescriptionExample options
If value is greater than or equal toThreshold of consecutive sync job failures3

The default behavior for this monitor is to alert with low severity at one failure, medium severity at three failures, and high severity at seven failures.

Scroll job failing on pipeline

Alerts when the "scroll" job for the object or link's active or replacement pipeline is failing. Scroll jobs are responsible for streaming data from the backing datasource to the object databases.

Rule componentDescriptionExample options
If value is greater than or equal toThreshold of consecutive scroll job failures3

The default behavior for this monitor is to alert with low severity at one failure, medium severity at three failures, and high severity at seven failures, and these values are configurable.

Sync propagation delay

Alerts when a dataset backing the object has a transaction with a sync time that exceeds a set threshold.

Rule componentDescriptionExample options
If value is greater than or equal toThreshold of time taken to sync a transaction1 day

Invalid stream records detected

Alerts when records in an input stream contain format violations. The scroll job ignores these records. This rule is non-configurable, alerting with critical severity when the number of ignored rows is greater than or equal to one.

Streaming dataset rules

Derived stream monitors

Last checkpoint duration

Alerts if the last checkpoint took more time than the configured threshold to complete.

Rule componentDescriptionExample options
If value is greater thanThreshold of time taken to checkpoint10 minutes

Liveness: time since last successful checkpoint

Alerts if the stream has not completed a checkpoint since the configured threshold. The default threshold configuration is 2 minutes. This monitor encompasses streams that are not running as well as streams failing a checkpoint.

Rule componentDescriptionExample options
If value is greater than or equal toThreshold of time elapsed since last checkpoint2 minutes

Total lag

Alerts if a stream's lag (total unprocessed upstream records) exceeds the set threshold.

Rule componentDescriptionExample options
If value is greater thanThreshold of unprocessed upstream records1000

This monitor indicates that streaming transforms are taking too long to run, or there is a problem with the streaming transforms infrastructure.

Total throughput

Alerts if a stream's throughput (records processed per checkpoints) falls below the set threshold.

Rule componentDescriptionExample options
If value is less thanThreshold of records processed per checkpoint100

This monitor indicates that streaming transforms are taking too long to run, or there is a problem with the streaming transforms infrastructure.

Ingest stream monitors

Records ingested over last 5 minutes / 30 minutes / 1 hour / 4 hours / 1 day

Alerts if the number of records ingested into the raw stream's live view over the selected time window was less than or equal to the configured threshold.

Rule componentDescriptionExample options
If value is less than or equal toThreshold of ingested records per unit time100

Live deployment rules

Live deployment heartbeat

Alerts when deployment has not emitted a heartbeat for more than one minute.

Rule componentDescriptionExample options
If value is greater than or equal toThreshold of time elapsed since last heartbeat1 minute

Time series sync rules

Points written by the time series sync over last 5 or 30 minutes

Alerts if the number of points written by the time series sync over the last 5 or 30 minute window was less than or equal to the configured threshold.

Rule componentDescriptionExample options
If value is less than or equal toThreshold of points written per unit time100

Geotemporal observation rules

Geotemporal observations sent over last 5 or 30 minutes

Alerts if the number of geotemporal observations sent over the last 5 or 30 minute window was less than or equal to the configured threshold.

Rule componentDescriptionExample options
If value is less than or equal toThreshold of geotemporal observations sent per unit time100

Automation rules

Automation has no new events

Alerts if there has been no new evaluation since the configured threshold.

Rule componentDescriptionExample options
If value is greater than or equal toThreshold of time elapsed since last automation evaluation1 hour

The latest event exceeded the automation's failure threshold

Alerts if the most recent execution had at least the configured number of failures.

Rule componentDescriptionExample options
If value is greater thanThreshold of number of failures in most recent automation execution10