LLM capacity is a limited resource at the industry level, and all providers (Azure, OpenAI, AWS Bedrock, Google Cloud Vertex, etc.) limit the maximum capacity available per account. Palantir AIP consequently follows the market-level constraint set introduced by LLM providers. The standard unit of measure across the industry is tokens per minute (TPM) and requests per minute (RPM).
Palantir has set a certain maximum capacity for each enrollment, referred to as “enrollment-level rate limits”. This capacity is measured per model using TPM and RPM, and includes all models of all providers enabled on your enrollment, including GPT, Claude, Gemini, Llama, Mixtral and more. In this way, each model has a separate, independent capacity not affected by the usage of other models.
By default, all customers are on the medium tier, which is large enough to build prototypes and scale to a few use cases, even with hundreds of users and large datasets, including millions of documents for example.
Additionally, AIP offers the option to upgrade the enrollment capacity from the medium tier to a large or XL tier if you require additional capacity. If you are constantly hitting enrollment rate limits blocking you from expanding your AIP usage, or if you expect you will increase the volume of your pipelines or total number of users, contact Palantir Support.
Enrollment limits are now displayed on the AIP rate limits tab in the Resource Management application, along with the enrollment tier.
AIP offers enough capacity to build large scale workflows with enrollment tiers, particularly the XL tier. These tiers have provided enough capacity for hundreds of Palantir customers using LLMs at scale, and we continue to increase these limits.
The table below contains enrollment limits for tokens per minute (TPM) and requests per minute (RPM) for each enrollment tier. For enrollments with both Azure and OpenAI enabled, enrollment limits will be double what is shown below for Azure and OpenAI. Additionally, for enrollments geo-restricted to a single region, TPM and RPM may be lower than the table indicates in the Large and X-large tiers.
Model provider | Model | Small tier | Medium tier | Large tier | X-large tier |
---|---|---|---|---|---|
Azure / OpenAI | GPT-4o | 60K TPM 150 RPM | 800K TPM 1K RPM | 1M TPM 1K RPM | 1.5M TPM 2K RPM |
Anthropic | Claude 3.5 Sonnet | 50K TPM 40 RPM | 1M TPM 200 RPM | 1.5M TPM 300 RPM | 2M TPM 400 RPM |
Gemini 1.5 Pro | 60K TPM 150 RPM | 2M TPM 400 RPM | 3M TPM 700 RPM | 4M TPM 1K RPM | |
Azure / OpenAI | GPT-4o-mini | 60K TPM 150 RPM | 800K TPM 1K RPM | 1M TPM 1.5K RPM | 1.5M TPM 2K RPM |
Anthropic | Claude 3 Haiku | 60K TPM 250 RPM | 600K TPM 1K RPM | 1.5M TPM 1.5K RPM | 2M TPM 2K RPM |
Gemini 1.5 Flash | 60K TPM 150 RPM | 2M TPM 600 RPM | 3M TPM 1.2K RPM | 4M TPM 2K RPM | |
Azure / OpenAI | o1-mini | 300K TPM 30 RPM | 1M TPM 1K RPM | 1M TPM 1.5K RPM | 1M TPM 2K RPM |
Azure / OpenAI | o1-preview | 150K TPM 15 RPM | 600K TPM 1K RPM | 600K TPM 1.5K RPM | 600K TPM 2K RPM |
Azure | GPT-4 | 60K TPM 300 RPM | 675K TPM 750 RPM | 675K TPM 750 RPM | 675K TPM 750 RPM |
Azure | GPT-4 Turbo | 60K TPM 120 RPM | 375K TPM 450 RPM | 562.5K TPM 675 RPM | 750K TPM 900 RPM |
Azure | GPT-4 32K | 75K TPM 45 RPM | 300K TPM 150 RPM | 450K TPM 225 RPM | 600K TPM 300 RPM |
Azure | GPT-3.5 | 150K TPM 450 RPM | 1.5M TPM 900 RPM | 2.25M TPM 1.35K RPM | 3M TPM 1.8K RPM |
Azure | GPT-3.5 16K | 300K TPM 450 RPM | 750K TPM 450 RPM | 1.125M TPM 675 RPM | 1.5M TPM 900 RPM |
Azure | GPT-4 Vision | 60K TPM 30 RPM | 375K TPM 150 RPM | 562.5K TPM 225 RPM | 750K TPM 300 RPM |
Anthropic | Claude 3 Sonnet | 50K TPM 100 RPM | 450K TPM 500 RPM | 675K TPM 750 RPM | 900K TPM 1K RPM |
Azure | Text Embedding Ada-002 | 450K TPM 3K RPM | 2.1M TPM 4.5K RPM | 3.15M TPM 6.75K RPM | 4.2M TPM 9K RPM |
Azure | Text Embedding 3 Small | 60K TPM 400 RPM | 300K TPM 2K RPM | 450K TPM 3K RPM | 600K TPM 6K RPM |
Azure | Text Embedding 3 Large | 60K TPM 400 RPM | 1M TPM 2K RPM | 2M TPM 3K RPM | 3M TPM 6K RPM |
Document Information Extraction | 1M TPM 40 RPM | 1M TPM 200 RPM |
Enrollment administrators can navigate to the AIP rate limits page in the Resource Management application to configure the maximum percent of TPM and RPM that all resources within a given Project can utilize at every given minute combined, per model.
This means that you have the flexibility to maximize LLM utilization for production use cases in case of ambitious use cases in AIP, and limit or disallow experimental projects from saturating the entire enrollment capacity.
By default, all Projects are given a specific limit at which to operate. An admin can create additional Project limits, define which Projects are included in each limit, and what percent of enrollment capacity can be used.
Generally, AIP prioritizes interactive requests over pipelines with batch requests. Interactive queries are defined as any real-time interaction that a user has with an LLM, such as AIP Assist, Workshop, Agent Studio, preview in the AIP Logic LLM board, and preview in the Pipeline Builder LLM node. Batch queries are defined as a large set of requests sent without a user expecting an immediate response, for example Transforms pipelines, Pipeline Builder, Automate (for Logic).
This principle currently guarantees that 20% of capacity at the enrollment and Project level will always be reserved for interactive queries. This means that for a 100,000 TPM capacity for a certain model, only a maximum of 80,000 TPM can be used for pipelines at any given minute, while at least 20,000 TPM (and up to 100,000 TPM) is available for interactive queries.
Consider the following example: