A Foundry Machine Learning live deployment is a persistent, scalable deployment for model releases that can be interacted with via an API endpoint. Live deployments continuously reserve dedicated compute resources to ensure that the deployment can quickly respond to incoming traffic. As a result, hosting a live deployment uses Foundry compute-seconds while the deployment is active. Note that this is for model-backed deployments only: JavaScript functions-backed deployments are not covered in this documentation.
When running, Foundry Machine Learning Live compute usage is attributed to the Modeling Objective itself and is aggregated at the level of the Project that contains the Modeling Objective. For a deep dive on the definition of a compute-second in Foundry and the origins of the formulas used for computing usage, review the Usage Types documentation.
Foundry Machine Learning Live hosts its infrastructure on dedicated “replicas” that run in Foundry’s pod-based computation cluster. Each replica is assigned a set of computational resources, measured in vCPU’s and GiB of RAM. Replicas each host the model locally and use their computational resources to service incoming requests.
A Foundry Machine Learning Live deployment uses compute-seconds while it is active, regardless of the number of incoming requests it receives. A deployment is considered “active” as soon as it is started and it lasts until the deployment is shut down via the graphical interface or API. If the Modeling objective that the live deployment is associated with is sent to the Compass trash, the live deployment will also be shut down.
The number of compute-seconds that a live deployment will use is dependent on three main factors:
When paying for Foundry usage, the default usage rates are the following:
vCPU / GPU | Usage Rate |
---|---|
vCPU | 0.2 |
T4 GPU | 1.2 |
A10G GPU | 1.5 |
V100 GPU | 3 |
These are the rates at which live models use compute based on their compute profile under Foundry's parallel compute framework. If you have an enterprise contract with Palantir, contact your Palantir representative before proceeding with compute usage calculations.
The following formula measures vCPU compute seconds:
live_deployment_vcpu_compute_seconds = max(vCPUs_per_replica, GiB_RAM_per_replica / 7.5) * num_replicas * live_model_vcpu_usage_rate * time_active_in_seconds
The following formula measures GPU compute seconds:
live_deployment_gpu_compute_seconds = GPUs_per_replica * num_replicas * live_model_gpu_usage_rate * time_active_in_seconds
All compute-second usage in the platform is available in the Resource Management App.
Compute usage for deployments is attached to the Foundry Modeling Objective that it is deployed from. Note that multiple live deployments can be active for any given Objective. The live deployments of a Modeling Objective can be found under the Deployments section. See the screenshot below for an example.
Live deployments use compute-seconds while they are active. There are a few strategies for controlling the overall usage of a deployment.
A live deployment’s resource usage is defined by its profile. The profile can be set at creation time of the live deployment. Profiles can be changed while the deployment is active. Deployments will automatically receive the updated profile with no downtime.
For a live deployment with the default replica profile of two replicas that is active for 20 seconds with the “low-cpu-lowest-memory” profile.
resource_config:
num_replicas: 2
vcpu_per_replica: 0.5 vCPU
GiB_RAM_per_replica: 1 GiB
seconds_active: 20 seconds
live_model_vcpu_usage_rate: 0.2
compute seconds = max(vcpu_per_replica, GiB_RAM_per_replica / 7.5) * num_replicas * live_model_vcpu_usage_rate * time_active_in_seconds
= max(0.5vCPU, 1GiB / 7.5) * 2replicas * 0.2 * 20sec
= 0.5 * 2 * 0.2 * 20
= 4 compute-seconds
The following example shows the usage rate for a live deployment. The live deployment has a default replica profile with two replicas and is active for 20 seconds with a GPU V100 profile
resource_config:
num_replicas: 2
gpu_per_replica: 1 V100 GPU
seconds_active: 20 seconds
live_model_gpu_usage_rate: 3
compute seconds = gpu_per_replica * num_replicas * live_model_gpu_usage_rate * time_active_in_seconds
= 1 * 2replicas * 3 * 20sec
= 1 * 2 * 3 * 20
= 120 compute-seconds