The following guide aims to provide methods and best practices for optimizing Foundry usage. This documentation firstly covers how usage in Foundry is determined, and secondly, how to identify usage waste and pipeline optimization. The general recommendations may also be of interest to project managers or platform administrators as they focus on monitoring and optimizing an organization’s usage consumption.
In addition to the best practices listed here, Linter checks the state of Foundry for anti-patterns and provides opinionated recommendations to improve the state of resources. You can evaluate and act on these recommendations to reduce cost, optimize your Ontology, and increase pipeline stability and resilience.
As you think about how to implement these best practices for your workflows, it is important to not fall into a well-known pitfall of prematurely optimizing a pipeline or workflow. Users should avoid prematurely optimizing pipelines and should not expect a one-size-fits-all strategy for optimization.
We advise following the mental steps below to check the validity of your approach:
If some of these questions are still to be answered, it might indicate that more pre-work is necessary in order to have a successful optimization effort.
With this in mind, below are good and bad examples of optimization efforts:
Some general practices for optimizing your Foundry usage include:
Foundry usage is made up of three components: Foundry compute, Ontology volume, and Foundry storage.
The majority of accounts are on this 3 dimension model; however, usage criteria may vary for some accounts. Review terms with your Palantir representative to confirm.
Foundry compute is driven by tools for data integration and analysis. There are three main types of Foundry compute: batch, interactive, and streaming.
Batch compute represents queries or jobs that run in a "batch" capacity, meaning they are triggered to run in the background on a certain scheduled cadence or ad-hoc basis. Batch compute jobs do not consume any compute when they are not being run. A few examples of batch compute include all transform jobs, builds of datasets from Contour, code workbooks, data health checks, and syncs to Ontology/indexed storage.
Interactive compute represents queries that are evaluated in real-time, usually as part of an interactive user session. To provide fast responses to users, interactive compute systems maintain always-on idle compute, which means interactive queries tend to take up more compute than batch evaluation. The main form of interactive compute is Contour - Contour dashboards, analyses, and embedded contour charts all are examples of interactive compute.
Streaming compute represents always-on processing jobs that continuously receive messages and process them using arbitrary logic. Streaming compute is measured for the length of time that the stream is ready to receive messages; streaming compute has the highest cost compared to batch and interactive compute. Examples of streaming compute include streaming transformations and Pipeline Builder streams.
The amount of compute usage for batch, interactive, and streaming are driven by the following factors:
The second component of Foundry usage is Ontology volume. One of Foundry’s most unique capabilities is the Ontology layer. An Ontology is a translation layer between enterprise data and the objects your Organization cares about. An Ontology is a categorization of your data world that allows an Organization to think of their data in more tangible terms such as an “aircraft” or “car” rather than aggregations of the many rows and columns that describe them. If you are not familiar with an Ontology, you can learn more from the documentation.
Ontology volume is driven by the following factors:
Foundry storage measures the general purpose data stored in the non-Ontology transformation layers in Foundry, sometimes referred to as "cold storage".
Dataset branches and previous transactions (and views) impact how much disk space a single dataset consumes. Foundry comes with a variety of retention rules to help you keep your Foundry instance lean. When files are removed from a view with a DELETE
transaction, the files are not removed from the underlying storage and thus continue to accrue storage costs. The only way to reduce size is to use Retention to clean up unnecessary transactions. Committing a DELETE
transaction or updating branches do not reduce the storage used.
Having a clear understanding of what makes up Foundry usage and what impacts it can provide you insight into optimization opportunities.
Foundry application | Usage impact type |
---|---|
Code Repositories | Foundry compute |
Pipeline Builder | Foundry compute |
Code Workbooks | Foundry compute |
Contour | Foundry compute |
Live models | Foundry compute |
Ontology | Ontology volume |
Dataset | Foundry storage |
The Resource Management application provides visibility and transparency for an Organization to understand their Foundry usage consumption. The application enables users to see Foundry usage consumption broken out by each Foundry usage type (Foundry compute, Ontology volume, and Foundry storage). A user can look at usage by resource (Project), source (application), and user.
When trying to identify where Foundry Usage can be optimized, the first place to check is the Resource Management application. This allows you to see what resources are taking up the most compute and identify where you have bottlenecks. From here, you can leverage Foundry usage optimization best practices to identify ways to potentially reduce usage, but always remember - focus on the bottlenecks.
As mentioned above, compute resources are managed at the Project-level by default within Resource Management (RMA); within RMA, we see Foundry compute, Ontology volume, and Foundry storage metrics measured per Project. Therefore, proper Project set-up is absolutely crucial in order to effectively track usage metrics across a data pipeline. A proper set-up will enable data engineers or platform administrators to monitor these usage metrics at key phases of the pipeline to identify areas to optimize. An improper set-up will result in a failure to identify resource-heavy and computationally expensive pieces of a data pipeline.
Foundry projects should be used to enable a properly structured data pipeline. The best practices for project set-up along with pipeline stages are covered in-depth in the recommended Project and team structure documentation. Ensuring that projects follow the recommended structure, from importing the raw data from datasources to an actual workflow, will enable users to analyze compute and storage metrics along key phases of a pipeline.
When looking for usage reduction strategies, administrators should consider who on their team should have access to create Projects and resources. Restricting this access to the smallest possible number of individuals who are educated on set-up best practices will allow for less spread of Projects and resources, cutting down on unnecessary storage and compute. Allowing any user to create Projects on the platform will likely result in Projects created against the recommended structure for tracking usage, leading to unnecessary and expensive pipelines that ultimately drive up usage. Organizations may manage their Project creation access differently based on the number of users on the platform and data access restrictions; developing a process for determining this access and educating those with access on proper Project structure is recommend to enable proper usage monitoring for these resources.
A key feature within the Resource Management application that enables you to control your spending in Foundry is resource queues. In order to constrain the amount of compute power associated with a specific project or multiple Projects, you can bundle Projects into queues. Each queue will be assigned a specific resource limit that defines the number of maximum vCPUs used at once. For example, you can assign XXX vCPUs to a given queue which will be the maximum number of vCPUs running at any given time for the Projects assigned. This will ensure that you have visibility and awareness to the amount of usage each Project will consume.
Incremental pipelines are often used to process input datasets that change significantly over time. By avoiding unnecessary computing on all the rows or files of data that have not changed, incremental pipelines enable lower end-to-end latency while minimizing compute costs. The way to execute this is by understanding the difference between a SNAPSHOT
and APPEND
transaction.
The default transaction type is SNAPSHOT
. Snapshot transactions are treated as a replacement of all data in the dataset. That means when you open a dataset where the latest transaction type is SNAPSHOT
, the preview will contain only data received in that latest snapshot transaction. The same happens when you try to read that dataset in a data transform or Contour analysis - you will only see data from that latest transaction.
Snapshot is the default transaction type because it’s the easiest one to use - each time your sync runs, it will download all data returned from the database query, and create a snapshot transaction that effectively replaces all data that was in the dataset before. Files present in a previous transaction are of course still available in the historical versions of the dataset, but the preview and the downstream transformation using the data will now access the new transaction by default.
Of course, snapshot transactions are simple to use correctly, but copying all data every time can be very inefficient. One potential efficiency improvement is to use the APPEND
transaction type instead.
When a dataset consists of append transactions, its default view is a sum of all transactions. This means you do not have to sync files you already uploaded when you use the APPEND
transaction type - only the new data is synced into Foundry. This results in a reduction of Foundry storage because each transaction only contains the added files, and NOT a snapshot of everything available in the source system.
Another way to optimize Foundry usage is via schedules. Schedules, configured in the Scheduler tool, are used to run builds on a recurring basis to keep data flowing through Foundry consistently.
Schedules should be set up to meet your Organization's requirements, but to optimize Foundry usage it is imperative that schedules are set up efficiently and not running more than necessary. For example, if you set up a schedule for a dataset to refresh at 8 AM everyday, but do not actually need updated data at 8 AM everyday - your Organization is wasting Foundry usage. Instead, you set your schedule for however frequently you need updated data, for example, every other day at 8 AM. Making this adjustment would halve the amount of Foundry usage.
The two biggest themes to keep in mind when thinking about best practices for optimizing schedules are 1) eliminating duplicate schedules and 2) eliminating unnecessary schedules.
To identify redundant schedules, start by going into the Data Lineage application and changing the node color to Schedule Count
. If select nodes have more than one schedule associated with them, select the node and view the Manage schedule tool. There, you will be able to view the associated schedules, determine who owns them, and whether they can be consolidated.
The best practice is to ensure each dataset in your pipeline only has one scheduled build associated with it. Having a dataset built by two different schedules can lead to queuing and a slow down on both schedules as well as wasteful batch compute.
Another best practice to reduce redundant schedules, is to avoid full builds and use connecting builds instead. One example is an Ontology pipeline that includes a raw dataset, cleaned dataset, data transforms, and then ultimately ending with an Ontology. Instead of having three schedules set up, one to run on the raw dataset, the second to run the cleaned dataset, and the third to run on the transformed dataset, you only need one schedule where the raw dataset is the trigger and the Ontology dataset is the target.
To identify unnecessary schedules, go into the Data Lineage application and color the nodes by Time since last built
. This allows you to view what data is being updated most frequently and determine whether this is the most optimal frequency for your Organization.
The frequency & timing of schedules is a critical factor in optimizing usage. How frequently does your Organization need data updated?
Additionally, it is best to not try and schedule builds all at the same time to ensure debugging will be more efficient and use less compute. When thinking about frequency and timing, it is important to orient back to your Organization's requirement and ensure the refresh rate you are setting complies with what your Organization needs, but not exceeding it.
Lastly, it is important to look at the Advanced options when setting up a schedule. Consider enforcing Abort build on failure to reduce wasted batch compute. You can also update the number of allowed attempts for failed jobs to three or lower, compared to the maximum of 5 attempts. It is also recommended to set the minutes between rebuilds to at least one to three minutes to give time for any glitches that caused the failure to resolve themselves.
Spark is an open-source, distributed cluster-computing framework for quick, large-scale data processing and analytics. Spark makes it easier and quicker for computers to process a lot of data or analytics by splitting up the work between different systems and tackling them in parallel instead of waiting for everything to be completed linearly.
As an initial disclaimer, and as a general best practice when it comes to optimization, we recommend that you manually tweak Spark configurations only if a particular bottleneck is identified, and not as a general practice on the entire platform. This is because new optimization features are progressively added and enabled on Foundry for transforms which do not have manual overrides. A simple example is the introduction of dynamic allocation on Spark 3. While it was previously very important to manually set the number of executors for a transform, nowadays this number is automatically adjusted at execution time to avoid waste.
Optimizing Spark should only be done by users who are familiar with Spark concepts and have a strong understanding of how it is used within a pipeline.
A first step to optimizing Spark is reviewing and understanding Spark profiles. A “Spark profile” is the configuration that Foundry uses to configure distributed compute resources (such as drivers and executors) with the appropriate amount of CPU cores and memory. Most of the time, we recommend using automatic profiles rather than attempting to tweak manually; however, sometimes you may be able to identify the use of a large driver that is not necessary.
You can use the Spark usage coloring on Data Lineage to identify datasets that may be using higher profiles.
Below are some core Spark concepts and terminology to understand before starting to optimize.
Beyond Spark profiles, the following best practices provide a starting point when looking to optimize Spark for the purpose of reducing Foundry usage.
count
, collect
, take
). Unlike transformations (such as filter
, select
) which are lazily executed, actions put constraints on the computation graph and prevent potential optimizations.