Monitoring rules are configured on a per resource basis, with rules for the following resources:
All monitoring rules contain a configurable field called Alert severity which is the severity granted to an alert when its condition is triggered. Monitoring views can be configured to only send out alerts that meet or exceed a certain severity.
Rule component | Description | Example options |
---|---|---|
Alert severity | Severity of monitoring report condition | Low, Medium, High |
Alerts when the agent bootstrapper's last heartbeat is older than a set threshold.
Rule component | Description | Example options |
---|---|---|
If value is greater than | Amount of time elapsed since the last heartbeat received from the agent bootstrapper | 10 minutes |
We recommend setting this monitor value to 10 minutes.
Alerts when the agent bootstrapper version has not been upgraded since a set threshold.
Rule component | Description | Example options |
---|---|---|
If value is greater than | Amount of time elapsed since the agent manager has been on an old version | 10 minutes |
We recommend setting this monitor value to 10 days.
Alerts when the agent version has not been upgraded since a set threshold.
Rule component | Description | Example options |
---|---|---|
If value is greater than | Amount of time elapsed since the agent has been on an older version | 10 days |
We recommend setting this monitor value to 10 days.
Alerts when the agent CPU utilization exceeds a set threshold.
Rule component | Description | Example options |
---|---|---|
If value is greater than | Percentage of CPU utilization | 80 |
We recommend setting this monitor value to 80 (%).
Alerts when the JVM heap usage exceeds a set threshold.
Rule component | Description | Example options |
---|---|---|
If value is greater than | Percentage of JVM heap used / JVM heap available | 70 |
We recommend setting this monitor value to 70 (%).
Alerts when the available disk space drops below a set threshold.
Rule component | Description | Example options |
---|---|---|
If value is less than | Available disk space | 10GB |
We recommend setting this monitor value to 10GB.
Alerts when a certificate in the agent's keystore will expire within a set threshold.
Rule component | Description | Example options |
---|---|---|
If value is less than | Amount of time until a certificate expires | 10 days |
** We recommend setting this monitor value to medium severity at less than 30 days and high severity at less than 10 days.**
Alerts when a certificate in the agent's truststore will expire within a set threshold.
Rule component | Description | Example options |
---|---|---|
If value is less than | Amount of time until a certificate expires | 10 days |
We recommend setting this monitor value to medium severity at less than 30 days and high severity at less than 10 days.
Alerts when the number of jobs queued on an agent exceeds a set threshold.
Rule component | Description | Example options |
---|---|---|
If value is greater than | The number of jobs in the agent's job queue | 70 |
We recommend setting this monitor value to 70 (jobs).
Alerts when the number of consecutive schedule failures meets or exceeds a set threshold. This does not count schedule runs that result in a cancelled build.
Rule component | Description | Example options |
---|---|---|
If value is greater than or equal to | Threshold of consecutive schedule failures | 1 |
The default behavior for this monitor is to alert with medium severity at one failure and high severity at three failures, though these thresholds are highly dependent on the frequency and stability of the schedules that are included in the monitoring rule's scope.
Alerts when a schedule is running longer than a set threshold.
Rule component | Description | Example options |
---|---|---|
If value is greater than or equal to | The duration of the schedule | 2 hours |
This monitor is typically used on highly critical schedules to quickly inform whether or not the schedule will complete in the expected time. Due to the variable nature of schedules, this monitor is often schedule-scoped. |
Alerts when the "changelog" job for the object or link's active pipeline is failing. This rule is non-configurable, alerting with medium severity at one changelog job failure and high severity at three changelog job failures.
Alerts when the "merge changes" job for the object or link's active pipeline is failing.
Rule component | Description | Example options |
---|---|---|
If value is greater than or equal to | Threshold of consecutive merge job failures | 3 |
The default behavior for this monitor is to alert with low severity at one failure, medium severity at three failures, and high severity at seven failures.
Alerts when the "sync" job for the object or link's active pipeline is failing.
Rule component | Description | Example options |
---|---|---|
If value is greater than or equal to | Threshold of consecutive sync job failures | 3 |
The default behavior for this monitor is to alert with low severity at one failure, medium severity at three failures, and high severity at seven failures.
Alerts when the "changelog" job for the object or link's replacement pipeline is failing. This rule is non-configurable, alerting with medium severity at one changelog job failure and high severity at three changelog job failures.
Alerts when the "merge changes" job for the object or link's replacement pipeline is failing.
Rule component | Description | Example options |
---|---|---|
If value is greater than or equal to | Threshold of consecutive merge job failures | 3 |
The default behavior for this monitor is to alert with low severity at one failure, medium severity at three failures, and high severity at seven failures.
Alerts when the "sync" job for the object or link's replacement pipeline is failing.
Rule component | Description | Example options |
---|---|---|
If value is greater than or equal to | Threshold of consecutive sync job failures | 3 |
The default behavior for this monitor is to alert with low severity at one failure, medium severity at three failures, and high severity at seven failures.
Alerts when the "scroll" job for the object or link's active or replacement pipeline is failing. Scroll jobs are responsible for streaming data from the backing datasource to the object databases.
Rule component | Description | Example options |
---|---|---|
If value is greater than or equal to | Threshold of consecutive scroll job failures | 3 |
The default behavior for this monitor is to alert with low severity at one failure, medium severity at three failures, and high severity at seven failures, and these values are configurable.
Alerts when a dataset backing the object has a transaction with a sync time that exceeds a set threshold.
Rule component | Description | Example options |
---|---|---|
If value is greater than or equal to | Threshold of time taken to sync a transaction | 1 day |
Alerts when records in an input stream contain format violations. The scroll job ignores these records. This rule is non-configurable, alerting with critical severity when the number of ignored rows is greater than or equal to one.
Alerts if the last checkpoint took more time than the configured threshold to complete.
Rule component | Description | Example options |
---|---|---|
If value is greater than | Threshold of time taken to checkpoint | 10 minutes |
Alerts if the stream has not completed a checkpoint since the configured threshold. The default threshold configuration is 2 minutes. This monitor encompasses streams that are not running as well as streams failing a checkpoint.
Rule component | Description | Example options |
---|---|---|
If value is greater than or equal to | Threshold of time elapsed since last checkpoint | 2 minutes |
Alerts if a stream's lag (total unprocessed upstream records) exceeds the set threshold.
Rule component | Description | Example options |
---|---|---|
If value is greater than | Threshold of unprocessed upstream records | 1000 |
This monitor indicates that streaming transforms are taking too long to run, or there is a problem with the streaming transforms infrastructure.
Alerts if a stream's throughput (records processed per checkpoints) falls below the set threshold.
Rule component | Description | Example options |
---|---|---|
If value is less than | Threshold of records processed per checkpoint | 100 |
This monitor indicates that streaming transforms are taking too long to run, or there is a problem with the streaming transforms infrastructure.
Alerts if the number of records ingested into the raw stream's live view over the selected time window was less than or equal to the configured threshold.
Rule component | Description | Example options |
---|---|---|
If value is less than or equal to | Threshold of ingested records per unit time | 100 |
Alerts when deployment has not emitted a heartbeat for more than one minute.
Rule component | Description | Example options |
---|---|---|
If value is greater than or equal to | Threshold of time elapsed since last heartbeat | 1 minute |
Alerts if the number of points written by the time series sync over the last 5 or 30 minute window was less than or equal to the configured threshold.
Rule component | Description | Example options |
---|---|---|
If value is less than or equal to | Threshold of points written per unit time | 100 |
Alerts if the number of geotemporal observations sent over the last 5 or 30 minute window was less than or equal to the configured threshold.
Rule component | Description | Example options |
---|---|---|
If value is less than or equal to | Threshold of geotemporal observations sent per unit time | 100 |
Alerts if there has been no new evaluation since the configured threshold.
Rule component | Description | Example options |
---|---|---|
If value is greater than or equal to | Threshold of time elapsed since last automation evaluation | 1 hour |
Alerts if the most recent execution had at least the configured number of failures.
Rule component | Description | Example options |
---|---|---|
If value is greater than | Threshold of number of failures in most recent automation execution | 10 |