Live deployment compute usage

A Foundry Machine Learning live deployment is a persistent, scalable deployment for model releases that can be interacted with via an API endpoint. Live deployments continuously reserve dedicated compute resources to ensure that the deployment can quickly respond to incoming traffic. As a result, hosting a live deployment uses Foundry compute-seconds while the deployment is active. Note that this is for model-backed deployments only: JavaScript functions-backed deployments are not covered in this documentation.

When running, Foundry Machine Learning Live compute usage is attributed to the Modeling Objective itself and is aggregated at the level of the Project that contains the Modeling Objective. For a deep dive on the definition of a compute-second in Foundry and the origins of the formulas used for computing usage, review the Usage Types documentation.

Measuring compute seconds

Foundry Machine Learning Live hosts its infrastructure on dedicated “replicas” that run in Foundry’s pod-based computation cluster. Each replica is assigned a set of computational resources, measured in vCPU’s and GiB of RAM. Replicas each host the model locally and use their computational resources to service incoming requests.

A Foundry Machine Learning Live deployment uses compute-seconds while it is active, regardless of the number of incoming requests it receives. A deployment is considered “active” as soon as it is started and it lasts until the deployment is shut down via the graphical interface or API. If the Modeling objective that the live deployment is associated with is sent to the Compass trash, the live deployment will also be shut down.

The number of compute-seconds that a live deployment will use is dependent on three main factors:

  • The number of vCPU’s per replica
    • For live deployments, vCPU’s are measured in millicores, each of which are 1/1000 of a vCPU
  • The GiB of RAM per replica
  • The number of GPUs per replica
  • The number of replicas
    • Each replica in the deployment will have an identical number of vCPU’s and GiB of RAM

When paying for Foundry usage, the default usage rates are the following:

vCPU / GPUUsage Rate
vCPU0.2
T4 GPU1.2
A10G GPU1.5
V100 GPU3

These are the rates at which live models use compute based on their compute profile under Foundry's parallel compute framework. If you have an enterprise contract with Palantir, contact your Palantir representative before proceeding with compute usage calculations.

The following formula measures vCPU compute seconds:

live_deployment_vcpu_compute_seconds = max(vCPUs_per_replica, GiB_RAM_per_replica / 7.5) * num_replicas * live_model_vcpu_usage_rate * time_active_in_seconds

The following formula measures GPU compute seconds:

live_deployment_gpu_compute_seconds = GPUs_per_replica * num_replicas * live_model_gpu_usage_rate * time_active_in_seconds

Investigating usage of a modeling objective

All compute-second usage in the platform is available in the Resource Management App.

Compute usage for deployments is attached to the Foundry Modeling Objective that it is deployed from. Note that multiple live deployments can be active for any given Objective. The live deployments of a Modeling Objective can be found under the Deployments section. See the screenshot below for an example.

Deployments

Drivers of increased or decreased usage

Live deployments use compute-seconds while they are active. There are a few strategies for controlling the overall usage of a deployment.

  • Ensure that the deployment is tuned correctly for the request load that you expect. Deployments should be tuned for the peak expected number of simultaneous requests. If a deployment is under-resourced then it will start to return failed responses to requests. However, over-resourcing a deployment can use more compute seconds than necessary.
    • Palantir recommends that live deployment administrators run stress-tests against live deployment endpoints to determine the correct resource configuration before deploying the model into a operationally critical setting.
  • Live deployments will run until they are explicitly stopped or canceled. It is important to monitor for live deployment usage to ensure that deployments are not erroneously left running when they are not needed. This can be common on staging deployments.
  • Increasing/decreasing API load on a deployment without changing its profile does not affect its compute usage. A live deployment will service as many requests as its resources will allow it to without changing the number of compute-seconds it uses.

Managing usage

A live deployment’s resource usage is defined by its profile. The profile can be set at creation time of the live deployment. Profiles can be changed while the deployment is active. Deployments will automatically receive the updated profile with no downtime.

Deployments

Usage examples

Example 1: vCPU compute

For a live deployment with the default replica profile of two replicas that is active for 20 seconds with the “low-cpu-lowest-memory” profile.

Deployments Deployments

resource_config:
    num_replicas: 2
    vcpu_per_replica: 0.5 vCPU
    GiB_RAM_per_replica: 1 GiB
seconds_active: 20 seconds
live_model_vcpu_usage_rate: 0.2

compute seconds = max(vcpu_per_replica, GiB_RAM_per_replica / 7.5) * num_replicas * live_model_vcpu_usage_rate * time_active_in_seconds
                = max(0.5vCPU, 1GiB / 7.5) * 2replicas * 0.2 * 20sec
                = 0.5 * 2 * 0.2 * 20
                = 4 compute-seconds

Example 2: GPU compute

The following example shows the usage rate for a live deployment. The live deployment has a default replica profile with two replicas and is active for 20 seconds with a GPU V100 profile

resource_config:
    num_replicas: 2
    gpu_per_replica: 1 V100 GPU
seconds_active: 20 seconds
live_model_gpu_usage_rate: 3

compute seconds = gpu_per_replica * num_replicas * live_model_gpu_usage_rate * time_active_in_seconds
                = 1 * 2replicas * 3 * 20sec
                = 1 * 2 * 3 * 20
                = 120 compute-seconds