6. Monitoring Data Pipeline Health9. Using Metrics To Determine Alerting Thresholds

9 - Using Metrics to Determine Alerting Thresholds

This content is also available at learn.palantir.com ↗ and is presented here for accessibility purposes.

📖 Task Introduction

In the foregoing tasks, we set alerting thresholds based on a basic understanding of what the underlying code is doing and how we expect the datasets in our build schedule to be used in the organization. At each build execution, Foundry updates a set of pipeline metrics that we can reference to help us update thresholds. We previously set our TSLU check to a floating value of 1 deviation above the median (of the most recent 10 executions), but this invites a gradual increase over time. Suppose we had a reliable statistical basis for setting the update value to a specific time.

In this task, you’ll practice looking at pipeline metrics to update alerting thresholds.

🔨 Task Instructions

  1. Open your Flight Alerts Schedule in the Data Lineage application.
  2. Select your schedule in the Manage schedules window to the right.
  3. Near the top of your schedule, locate and select Metrics, which opens the Scheduler metrics UI.
  4. Review the high-level metrics visualizations in the Overview tab.
  5. Expand the ▸ Advanced statistics section to view the “p-values” associated with some time-based stats, including TSLU and build duration. After your pipeline has run several times, you should reference these metrics to help determine your alerting thresholds.
    • As an example, perhaps we want to set our build duration check threshold at the P75 value. Or, if the P50 value (the median) is our threshold, we may elect to alert at 1 standard deviation above the median.
  6. Make any desired changes to your temporal metrics based on what you see in your scheduler metrics (there may be nothing you want to change).