The ability to debug and resolve pipeline problems quickly is a core part of pipeline maintenance work. It ensures production pipelines feeding important organizational workflows remain reliable and meaningful.
This page provides a framework which you can use as the basis of a standard operating procedure (SOP) when receiving a notification for a health check failure during an on-call rotation as a pipeline maintainer.
This page assumes you are familiar with a variety of Foundry tools and workflows. Links will be provided in the relevant sections:
It also assumes that your pipeline maintainer team records an incident log or other documentation for recurring issues in your pipeline. This is a best practice and should be implemented if there is not currently such a document.
Always begin by asking the following three questions, in order:
Read your pipeline’s documentation! Perhaps this problem has previously been solved. Or there may be warnings of what not to do during mitigation. For example, some builds can be very expensive and may impact the performance of your environment during peak usage times. These kinds of details should be well-documented for your whole team.
After attempting to mitigate the problem, as a pipeline maintainer you will need to go deeper to understand and remedy the root cause. The reason classifying a problem is helpful during debugging, is that it helps us identify the root cause and most importantly, helps you quickly identify whether you can fix the issue or whether you’ll need to contact another team.
There are three categories of problems:
In detail, the steps highlighted above are:
Identification: When going through the steps above, it is important to identify very precisely what is broken. Answer questions like:
This allows you to communicate effectively with other teams if needed for upstream issues and platform issues and reduces resolution time. It also improves debugging skills in the platform.
Action:
[Optional] Downstream user communications: A step not mentioned in the above diagram is that when an issue has been classified and further root caused, it may be appropriate to notify downstream consumers of the pipeline. This depends on the problem impact, scope, duration and the use case of the pipeline.
Workarounds: if a fix from another team or from a user is going to take some time, it may be useful to implement medium-term workarounds to ensure the healthy part of your pipeline continues to run for downstream consumers. The exact temporary fix depends on the issue and needs of your users. Examples include:
The most common issues that arises for a pipeline maintainer result as an unintended consequence of something changing within a pipeline you monitor. It is also the category where, as a pipeline maintainer, you have the most control and are able to fix the problem directly without needing to rely on another team.
In more detail, the steps to take are:
Track down precisely where the issue is coming from in your pipeline as best you can. For example, try and identify the schedule, the dataset, the transaction, the code change, and so on.
Compare a healthy previous run to the current broken state to identify what changed. It can be useful to have a mental checklist of questions. Below is an example set of questions, along with some example tools that could help you find an answer:
If you are not familiar with the tooling in Foundry used to answer some of the above questions, the list below provides examples of the most common patterns to use during your investigation. This list does not cover all possibilities but rather serves as a starting guide:
Is my job/build slower than usual?
Builds application for comparing jobs for a given dataset. The progress details toggle in the top-right of a Build overview will allow you to see the progress of your build by queuing time vs compute time.
Schedule metrics which allows you to see historical runs of a schedule as well as metrics and graphs to compare runs
Are there any changes to the size of my dataset? Is my transform running with more data?
Dataset Preview: the history and compare tab of any Foundry dataset provides an overview of the history of a dataset as well as the ability to compare to previous transactions of your dataset to get an overview of what changed.
Contour provides access to the historical view to compare row number using the summary board or if you have a column that represents the date that data was added/created, you can create charts to compare the number of rows against the date added.
Spark details: By clicking on the Spark details button (see below) on any job, you will be able to see information that can help indicate if there is more data in your pipeline such as the count of tasks
metric.
Has the code changed in my pipeline?
Code changes may occur in imported libraries in a transform if using a language that supports this such as Python or Java. If you don’t see a change on your transform, consider checking if there was a logic change in a library function.
Has my schedule been altered?
Checking for similar symptoms in other jobs, builds or related platform components can be a useful investigation path if you’re not sure what the problem is based on the symptoms you see.
In particular, you should look for answers to these questions:
Using Builds application to filter down job history across the platform can help you answer the above questions.
The ability to debug and resolve pipeline problems quickly is a core part of a pipeline maintainer’s work. It ensures production pipelines feeding important organizational workflows remain reliable and meaningful. If you find yourself following the guidelines outlined on this page and are unable to still identify the issue at hand, contact your Palantir representative.