Monitoring at scale

Monitoring at scale introduces new capabilities that make monitoring Foundry resources less time-intensive.

If you are already using check groups, think of this as an additional option for monitoring your resources. It will not replace any workflows or check groups you have already set up.

Terms and definitions

  • Metric: Resources emit metrics, or logs. Monitors are created on top of these metrics to set a user’s standards of performance on a given resource.
  • Resource: A “thing” in Foundry that can be monitored, including datasets, agents, schedules, objects, and link types.
  • Scope: A scope is the boundary around the set of resources on which your thresholds are set. A resource can be monitored on different scope types:
    • Single: The monitor is only applied to that specific resource.
    • Project: The monitor is applied to any resources of the specified type in the Project or multiple Projects.
  • Monitoring rule: A threshold or set of thresholds put on the metrics of a resource within a given scope and contain:
    • Resource type
    • Metric threshold tolerances
    • Severity level assignment
  • Monitoring view: A collection of monitoring rules that a group of subscribers care about.
  • Subscriber: A user subscribed to a monitoring view.
  • Alerts: Notifications that can have low, medium, or high assignments and are sent to subscribers.

Start monitoring resources

You can monitor resources in two ways:

  • Upgrade an existing check group to a monitoring view
  • Create a new monitoring view

Upgrade an existing check group to a monitoring view

To upgrade an existing check group, open your check group in the Data Health application. In the top banner, select Upgrade to Monitoring View.

You can create a new monitoring view or move all the checks to an existing monitoring view.

  • Monitoring views are filesystem resources. If you are creating a new monitoring view, be sure to store it in a Project accessible to potential subscribers.
  • After upgrading your check group, checks will continue to be supported exactly as they are now. There are no changes to email digest, alerting, subscriptions, or any other workflow related to health checks.
  • Each check group can be linked to a single monitoring view and vice versa; therefore, you can only upgrade one check group to a single existing monitoring view, or create a new monitoring view if a suitable one does not exist.

Create a new monitoring view

To create a new monitoring view, go to the Monitoring View tab in the top right corner of the Data Health app and create a new monitoring view.

Create monitoring rules

To create a monitoring rule, navigate to the Manage monitors tab. First, select the resource type you are looking to monitor. Depending on the resource type, you can either choose to monitor just that resource on a single scope, or you can monitor all the resources of that type across a single or multiple Project scope.

You must have Viewer permission on the resources to monitor them. To receive alerts triggered by monitoring rules, you must have Viewer permission on the resources and the monitoring view.

Configure monitors

Monitors are set on the metrics a resource emits. As you set up your monitors, we suggest certain configurations based on Foundry’s standards for health. However, you can change the values or choose to only monitor certain metrics. You can also determine the level of severity the alert will have when it fails. Currently there are three severity types: low, medium, and high.

Edit monitors

You can edit your monitors by selecting from the list of monitors and choosing Edit on the side panel that appears.

Subscribe to alerts

To subscribe to alerts, navigate to the Manage subscriptions tab where all the subscribed users are listed. You can add users and user groups, and configure their alerts based on severity. When a monitor rule triggers an alert, the user subscribed to the monitoring view containing that alert will be notified via email and Foundry notifications. Note that you must have Viewer permission on the resources and the monitoring view to be able to receive alerts.

Integrate with PagerDuty

Monitoring Views can trigger and resolve alerts in PagerDuty corresponding to the analogous alerts produced within Foundry. This integration uses the PagerDuty V2 Events API ↗ and does not require a service user, emails, or custom allowlisting or egress configuration on most stacks. A single integration maps all alerts of a given severity within a monitoring view to an Events V2 API integration defined within a PagerDuty service. Note that multiple integrations defined within a monitoring view can map to the same PagerDuty integration key.

Create an Events V2 API integration for your PagerDuty service

Configure a PagerDuty service with your desired escalation policy, urgency settings, and support hours. On the Integrations tab for the service, add a new integration. Select Events API V2 as the integration type and click Add. (This should appear in the Most popular integrations section.) Once the integration is added, clicking on the gear symbol will show its details, including the Integration Key that you'll need in the next step.

Create a new PagerDuty integration for your monitoring view

Navigate to the Manage subscriptions tab for your monitoring view; in the Pagerduty Notifications section, click the plus sign (+) to create a new PagerDuty integration. You will need to specify a name for the integration, the integration key from the previous step, and the severity level. Repeat as needed for each applicable severity level.

Enable PagerDuty for health checks

By default, the monitoring view will produce PagerDuty alerts for monitoring rule alerts and for legacy health checks belonging to the check group that was upgraded/linked to the monitoring view. However, monitoring views created prior to the v1.860.0 release (February 2024) will not produce PagerDuty alerts by default and need to be manually enabled. To enable this feature, select the Enable PagerDuty for health checks checkbox. Note that moderate severity health checks will use the MEDIUM severity integrations, and critical severity health checks will use the HIGH severity integrations.

FAQ

What resources can be monitored?

You can monitor the following:

Resource typeSupported scope
AgentSingle, Project
Object typeSingle
Link typeSingle
ScheduleSingle, Project
Streaming datasetsSingle, Project
Live deploymentsProject
Dataset (coming soon)Project

Do all health checks now exist as monitoring rules?

Not all health checks exist as monitoring rules, but the most important health checks have analogous monitoring rules. We recommend using a combination of monitoring rules and health checks in a linked check group. To summarize coverage from monitoring views and health checks:

  1. Resources that can only be monitored with monitoring views: Data connection agents, objects and links in Object Storage V2 (OSv2), Streaming datasets, and Live deployments of models
  2. Dataset-level checks that only exist as health checks: Content, freshness, and schema checks; data expectations; OSv1 (phonograph) and foundry-sync checks
  3. Monitoring rules that replace functionality from health checks: Consecutive schedule failures (replacing schedule status checks) and Schedule duration monitors

For the most comprehensive coverage, we suggest linking your monitoring view to a check group that consists of health checks not currently available in monitoring views.

Why use monitors over health checks?

Monitors cover an entire scope rather than a single resource. This means that when an additional resource is added to that scope, it is automatically covered by the rule. For example, a monitoring rule that is set up to monitor all agents in a Project will also monitor any further agents added into that Project at a later time.

When should I create a new monitoring view instead of adding new rules to an existing one?

A good practice is to think of a single monitoring view the same way you would think of a check group. One monitoring view should relate to a set of users who care about the monitors that are in that view. If a specific set of users [a, b, c] cares about specific Projects [x, y, z], create a single monitoring view with all the resources in those Projects. If a specific set of users only care about monitoring agents, you should create a single monitoring view to monitor all agents in all Projects.

What permissions are required for a monitoring view?

Since a monitoring view is a filesystem resource, a user will need permission to the Project or folder in which the view is saved. To receive alerts or set up monitoring rules on a resource, the user will need access to the Project resources they wish to monitor. Even if a user with all necessary permissions subscribes a user or group to a monitoring view, those new subscribers will NOT receive alerts on any resources if they do not have explicit access permissions to that monitoring view.