LLM capacity is a limited resource at the industry level, and all providers (Azure, OpenAI, AWS Bedrock, Google Cloud Vertex, etc.) limit the maximum capacity available per account. Palantir AIP consequently follows the market-level constraint set introduced by LLM providers. The standard unit of measure across the industry is tokens per minute (TPM) and requests per minute (RPM).
Palantir has set a certain maximum capacity for each enrollment, referred to as “enrollment-level rate limits”. This capacity is measured per model using TPM and RPM, and includes all models of all providers enabled on your enrollment, including GPT, Claude, Gemini, Llama, Mixtral and more. In this way, each model has a separate, independent capacity not affected by the usage of other models.
By default, all customers are on the medium tier, which is large enough to build prototypes and scale to a few use cases, even with hundreds of users and large datasets, including millions of documents for example.
Additionally, AIP offers the option to upgrade the enrollment capacity from the medium tier to a large or XL tier if you require additional capacity. If you are constantly hitting enrollment rate limits blocking you from expanding your AIP usage, or if you expect you will increase the volume of your pipelines or total number of users, contact Palantir Support.
Enrollment limits are now displayed on the AIP rate limits tab in the Resource Management application, along with the enrollment tier.
AIP offers enough capacity to build large scale workflows with enrollment tiers, particularly the XL tier. These tiers have provided enough capacity for hundreds of Palantir customers using LLMs at scale, and we continue to increase these limits.
The table below contains enrollment limits for tokens per minute (TPM) and requests per minute (RPM) for each enrollment tier. For enrollments with both Azure and OpenAI enabled, enrollment limits will be double what is shown below for Azure and OpenAI. Additionally, for enrollments geo-restricted to a single region, TPM and RPM may be lower than the table indicates in the Large and X-large tiers.
Model provider | Model | Small tier | Medium tier | Large tier | X-large tier |
---|---|---|---|---|---|
Anthropic | Claude 3.7 Sonnet | 20K TPM 40 RPM | 300K TPM 200 RPM | 400K TPM 300 RPM | 500K TPM 400 RPM |
Anthropic | Claude 3.5 Haiku | 100K TPM 400 RPM | 1.2M TPM 1800 RPM | 1.8M TPM 2700 RPM | 2.4M TPM 3600 RPM |
Gemini 1.5 Pro | 60K TPM 150 RPM | 2M TPM 400 RPM | 3M TPM 700 RPM | 4M TPM 1K RPM | |
Gemini 2.0 Flash | 60K TPM 150 RPM | 2M TPM 600 RPM | 3M TPM 1.2K RPM | 4M TPM 2K RPM | |
Azure / OpenAI | GPT-4o | 60K TPM 150 RPM | 800K TPM 1K RPM | 1M TPM 1K RPM | 1.5M TPM 2K RPM |
Azure / OpenAI | GPT-4o-mini | 60K TPM 150 RPM | 800K TPM 1K RPM | 1M TPM 1.5K RPM | 1.5M TPM 2K RPM |
Azure | Text Embedding Ada-002 | 450K TPM 3K RPM | 2.1M TPM 4.5K RPM | 3.15M TPM 6.75K RPM | 4.2M TPM 9K RPM |
Azure | Text Embedding 3 Small | 60K TPM 400 RPM | 300K TPM 2K RPM | 450K TPM 3K RPM | 600K TPM 6K RPM |
Azure | Text Embedding 3 Large | 60K TPM 400 RPM | 1M TPM 2K RPM | 2M TPM 3K RPM | 3M TPM 6K RPM |
Azure / OpenAI | o1-mini | 100K TPM 10 RPM | 250K TPM 25 RPM | 400K TPM 40 RPM | 750K TPM 75 RPM |
Azure / OpenAI | o1-preview | 100K TPM 10 RPM | 250K TPM 25 RPM | 400K TPM 40 RPM | 750K TPM 75 RPM |
Anthropic | Claude 3.5 Sonnet v2 | 30K TPM 20 RPM | 300K TPM 100 RPM | 500K TPM 200 RPM | 600K TPM 300 RPM |
Anthropic | Claude 3.5 Sonnet | 50K TPM 40 RPM | 1M TPM 200 RPM | 1.5M TPM 300 RPM | 2M TPM 400 RPM |
Anthropic | Claude 3 Sonnet | 50K TPM 100 RPM | 450K TPM 500 RPM | 675K TPM 750 RPM | 900K TPM 1K RPM |
Anthropic | Claude 3 Haiku | 60K TPM 250 RPM | 600K TPM 1K RPM | 1.5M TPM 1.5K RPM | 2M TPM 2K RPM |
Gemini 1.5 Flash | 60K TPM 150 RPM | 2M TPM 600 RPM | 3M TPM 1.2K RPM | 4M TPM 2K RPM | |
Azure | GPT-4 | 60K TPM 300 RPM | 675K TPM 750 RPM | 675K TPM 750 RPM | 675K TPM 750 RPM |
Azure | GPT-4 Turbo | 60K TPM 120 RPM | 375K TPM 450 RPM | 562.5K TPM 675 RPM | 750K TPM 900 RPM |
Azure | GPT-4 32K | 75K TPM 45 RPM | 300K TPM 150 RPM | 450K TPM 225 RPM | 600K TPM 300 RPM |
Azure | GPT-3.5 | 150K TPM 450 RPM | 1.5M TPM 900 RPM | 2.25M TPM 1.35K RPM | 3M TPM 1.8K RPM |
Azure | GPT-3.5 16K | 300K TPM 450 RPM | 750K TPM 450 RPM | 1.125M TPM 675 RPM | 1.5M TPM 900 RPM |
Azure | GPT-4 Vision | 60K TPM 30 RPM | 375K TPM 150 RPM | 562.5K TPM 225 RPM | 750K TPM 300 RPM |
Snowflake | Arctic Embed Medium | 60K TPM 150 RPM | 300K TPM 450 RPM | 450K TPM 675 RPM | 600K TPM 900 RPM |
Document Information Extraction | 1M TPM 40 RPM | 1M TPM 200 RPM |
Enrollment administrators can navigate to the AIP usage & limits page in the Resource Management application to:
View usage: View LLM token and request usage of all Palantir-provided models for all Projects and resources in your enrollment.
Manage rate limits: Configure the maximum percent of TPM and RPM that all resources within a given Project can utilize at every given minute combined, per model.
The View usage tab provides visibility into LLM token and request usage of all Palantir-provided models for all Projects and resources in your enrollment. Administrators can use this view to better manage LLM capacity and handle rate limits.
This view allows you to:
Note that this view is not optimized to address cost management for LLM usage. Learn how to review LLM cost on AIP-enabled enrollments via the Analysis tab.
If you are hitting rate limits at the enrollment or Project level, you may consider taking any of the following actions:
On the Manage rate limits tab, you have the flexibility to maximize LLM utilization for production use cases in case of ambitious use cases in AIP, and limit or disallow experimental projects from saturating the entire enrollment capacity. Enrollment administrators can configure the maximum percent of TPM and RPM that all resources within a given Project can utilize at every given minute combined, per model.
By default, all Projects are given a specific limit at which to operate. An admin can create additional Project limits, define which Projects are included in each limit, and what percent of enrollment capacity can be used.
Use the Analysis page to view the cost of LLM usage on your AIP-enabled enrollment.
From the Analysis page, select Filter by source: All LLMs
and Group by source. This will generate a chart of daily LLM cost, segmented by model.
Generally, AIP prioritizes interactive requests over pipelines with batch requests. Interactive queries are defined as any real-time interaction that a user has with an LLM, such as AIP Assist, Workshop, Agent Studio, preview in the AIP Logic LLM board, and preview in the Pipeline Builder LLM node. Batch queries are defined as a large set of requests sent without a user expecting an immediate response, for example Transforms pipelines, Pipeline Builder, Automate (for Logic).
This principle currently guarantees that 20% of capacity at the enrollment and Project level will always be reserved for interactive queries. This means that for a 100,000 TPM capacity for a certain model, only a maximum of 80,000 TPM can be used for pipelines at any given minute, while at least 20,000 TPM (and up to 100,000 TPM) is available for interactive queries.
Consider the following example: