LLM capacity management for AIP

LLM capacity is a limited resource at the industry level, and all providers (Azure, OpenAI, AWS Bedrock, Google Cloud Vertex, etc.) limit the maximum capacity available per account. Palantir AIP consequently follows the market-level constraint set introduced by LLM providers. The standard unit of measure across the industry is tokens per minute (TPM) and requests per minute (RPM).

Enrollment capacity and rate limits

Palantir has set a certain maximum capacity for each enrollment, referred to as “enrollment-level rate limits”. This capacity is measured per model using TPM and RPM, and includes all models of all providers enabled on your enrollment, including GPT, Claude, Gemini, Llama, Mixtral and more. In this way, each model has a separate, independent capacity not affected by the usage of other models.

By default, all customers are on the medium tier, which is large enough to build prototypes and scale to a few use cases, even with hundreds of users and large datasets, including millions of documents for example.

Additionally, AIP offers the option to upgrade the enrollment capacity from the medium tier to a large or XL tier if you require additional capacity. If you are constantly hitting enrollment rate limits blocking you from expanding your AIP usage, or if you expect you will increase the volume of your pipelines or total number of users, contact Palantir Support.

Enrollment limits are now displayed on the AIP rate limits tab in the Resource Management application, along with the enrollment tier.

Total enrollment capacity can be seen in the Resource Management application.

AIP offers enough capacity to build large scale workflows with enrollment tiers, particularly the XL tier. These tiers have provided enough capacity for hundreds of Palantir customers using LLMs at scale, and we continue to increase these limits.

The table below contains enrollment limits for tokens per minute (TPM) and requests per minute (RPM) for each enrollment tier. For enrollments with both Azure and OpenAI enabled, enrollment limits will be double what is shown below for Azure and OpenAI. Additionally, for enrollments geo-restricted to a single region, TPM and RPM may be lower than the table indicates in the Large and X-large tiers.

Model providerModelSmall tierMedium tierLarge tierX-large tier
Azure / OpenAIGPT-4o60K TPM
150 RPM
800K TPM
1K RPM
1M TPM
1K RPM
1.5M TPM
2K RPM
AnthropicClaude 3.5 Sonnet50K TPM
40 RPM
1M TPM
200 RPM
1.5M TPM
300 RPM
2M TPM
400 RPM
GoogleGemini 1.5 Pro60K TPM
150 RPM
2M TPM
400 RPM
3M TPM
700 RPM
4M TPM
1K RPM
Azure / OpenAIGPT-4o-mini60K TPM
150 RPM
800K TPM
1K RPM
1M TPM
1.5K RPM
1.5M TPM
2K RPM
AnthropicClaude 3 Haiku60K TPM
250 RPM
600K TPM
1K RPM
1.5M TPM
1.5K RPM
2M TPM
2K RPM
GoogleGemini 1.5 Flash60K TPM
150 RPM
2M TPM
600 RPM
3M TPM
1.2K RPM
4M TPM
2K RPM
Azure / OpenAIo1-mini300K TPM
30 RPM
1M TPM
1K RPM
1M TPM
1.5K RPM
1M TPM
2K RPM
Azure / OpenAIo1-preview150K TPM
15 RPM
600K TPM
1K RPM
600K TPM
1.5K RPM
600K TPM
2K RPM
AzureGPT-460K TPM
300 RPM
675K TPM
750 RPM
675K TPM
750 RPM
675K TPM
750 RPM
AzureGPT-4 Turbo60K TPM
120 RPM
375K TPM
450 RPM
562.5K TPM
675 RPM
750K TPM
900 RPM
AzureGPT-4 32K75K TPM
45 RPM
300K TPM
150 RPM
450K TPM
225 RPM
600K TPM
300 RPM
AzureGPT-3.5150K TPM
450 RPM
1.5M TPM
900 RPM
2.25M TPM
1.35K RPM
3M TPM
1.8K RPM
AzureGPT-3.5 16K300K TPM
450 RPM
750K TPM
450 RPM
1.125M TPM
675 RPM
1.5M TPM
900 RPM
AzureGPT-4 Vision60K TPM
30 RPM
375K TPM
150 RPM
562.5K TPM
225 RPM
750K TPM
300 RPM
AnthropicClaude 3 Sonnet50K TPM
100 RPM
450K TPM
500 RPM
675K TPM
750 RPM
900K TPM
1K RPM
AzureText Embedding Ada-002450K TPM
3K RPM
2.1M TPM
4.5K RPM
3.15M TPM
6.75K RPM
4.2M TPM
9K RPM
AzureText Embedding 3 Small60K TPM
400 RPM
300K TPM
2K RPM
450K TPM
3K RPM
600K TPM
6K RPM
AzureText Embedding 3 Large60K TPM
400 RPM
1M TPM
2K RPM
2M TPM
3K RPM
3M TPM
6K RPM
Document Information Extraction1M TPM
40 RPM
1M TPM
200 RPM

Project rate limits

Enrollment administrators can navigate to the AIP rate limits page in the Resource Management application to configure the maximum percent of TPM and RPM that all resources within a given Project can utilize at every given minute combined, per model.

This means that you have the flexibility to maximize LLM utilization for production use cases in case of ambitious use cases in AIP, and limit or disallow experimental projects from saturating the entire enrollment capacity.

By default, all Projects are given a specific limit at which to operate. An admin can create additional Project limits, define which Projects are included in each limit, and what percent of enrollment capacity can be used.

Check rate limits for your models on the AIP rate limits page in the Resource Management application.

Prioritizing interactive queries

Generally, AIP prioritizes interactive requests over pipelines with batch requests. Interactive queries are defined as any real-time interaction that a user has with an LLM, such as AIP Assist, Workshop, Agent Studio, preview in the AIP Logic LLM board, and preview in the Pipeline Builder LLM node. Batch queries are defined as a large set of requests sent without a user expecting an immediate response, for example Transforms pipelines, Pipeline Builder, Automate (for Logic).

This principle currently guarantees that 20% of capacity at the enrollment and Project level will always be reserved for interactive queries. This means that for a 100,000 TPM capacity for a certain model, only a maximum of 80,000 TPM can be used for pipelines at any given minute, while at least 20,000 TPM (and up to 100,000 TPM) is available for interactive queries.

FAQ

What is an example of how Project-level rate limits are expected to be used?

Consider the following example:

  • An enrollment only has a single AIP use case in production, so the Project containing that use case is moved under a “Production” limit to access up to 100% of the enrollment limit.
  • In addition to this production use case, there is a second use case in the testing stage to consider. This testing stage use case should be able to run tests without taking over the entire production usage. This use case can be added to a “Testing” limit with up to 30% of capacity. The “Production” limit is reduced to 90% to make sure there is always some capacity for testing.
  • On top of the previously-mentioned use cases, we add a second use case in production. However, unlike the first one that used GPT4o, this one uses Claude 3.5 Sonnet. We can safely add this new use case to the “Production” limit next to the first production use case.
  • The same enrollment wants a set of users to be able to experiment with LLMs. The enrollment administrator adds two Projects to an “Experimentation” limit with up to 20% capacity.
  • The testing Project and the two experimental Projects could technically expend up to 70% of capacity combined, but historical data shows that actual usage typically falls below this.
  • Lastly, this enrollment wants to enable several users to experiment with LLMs. An enrollment admin can set the default limit to 10% capacity and the user folders to 0% capacity, while giving these specified users LLM builder permissions in the Control Panel AIP settings.

Why is the percent enforced on each Project in a limit category and not shared across Projects?

  • The reason multiple Projects and resources can share the same 100% capacity is that based on historical LLM usage patterns across hundreds of customers over the span of more than a year, most Projects and resources do not make calls to LLMs. As such, multiple resources can share the same 100% capacity.
  • If all Projects within a limit category were to share the same usage percentage, a hard limit on usage would be implemented. However, based on existing usage, this is not justified for 99% of cases. It is very rare that multiple resources use the maximum capacity at the same minute, and even when this happens, requests will retry until successful.

Why are there AIP usage limits?

  • First, there is significant variance in the offering of different providers in terms of TPM, RPM, and regional availability. While AIP does leverage the capacity of all providers, Palantir cannot bypass limitations imposed by the various cloud service providers.
  • On top of that, LLM capacity provided to a customer by Palantir has a high bar of compliance requirements compared to the common offering from most providers. Palantir guarantees zero data retention (ZDR) and control over routing of data to specific regions (geo-restriction).
  • Direct OpenAI does not yet support geo-restriction for AIP. This means that for example, OpenAI cannot guarantee that requests are routed to the EU and stay in the EU. Requests might be processed in data centers in America, Asia, Africa, or Europe - which gives OpenAI much more flexibility and a much larger pool of capacity to work with.
    • AIP customers with no geo-restriction can use this large pool of capacity. An upgrade to the XL tier is available for users with higher usage levels.
    • Certain capabilities are still unavailable, such as batch API. Batch API supports processing billions of tokens within 24 hours, but requires storing data for that period, which fails Palantir’s compliance requirements.
  • Other providers, namely Azure OpenAI, AWS Bedrock, GCP Vertex and Palantir-hosted Llama and Mixtral models, all support geo-restrictions but also have much smaller LLM capacity guarantees for geo-restricted requests.
    • Securing capacity in a certain region is harder and often requires securing provisioned throughput, which is a monthly prepaid capacity guarantee that Palantir takes care of for its customers. This is often limited even on the providers’ side.
    • Certain models are still not widely available in certain regions, but Palantir has early access to them. This is the case with GPT models in the UK for example.
  • As mentioned above, our medium to XL tiers are enough for large scale production workflows. Contact Palantir Support to change your tier.

Why is it not possible to reserve minimum guaranteed capacity for production use cases?

  • This is currently in development, and consists of two separate obstacles. The first is solving for maximum capacity to unblock usage, and the second is to solve for minimum reserved capacity to guarantee the success of use cases in production.
  • In addition, we are working on enabling a bring-your-own-model (and account) option as an alternative to scale. Contact Palantir Support for details.

What are the biggest obstacles to solving the capacity problem?

  • Geo-restriction is the strongest cause of capacity issues. If your enrollment is geo-restricted and you are able to remove geo-restrictions from a legal perspective, you should work with your Palantir team to do so.
  • New models often have limited capacity in early stages. For example, this was true for GPT4-vision, GPT-o1, and later for Claude 3.5 Sonnet when it was first launched.
  • The capacity problem is much harder with large pipelines that run over many millions of items.