Scaling

Compute modules offer automatic horizontal scaling capabilities, allowing you to efficiently manage your deployment's resources. You can configure a range of replicas and set concurrency limits per replica, both of which influence scaling behavior.

Minimum replicas

Non-zero minimum: Set the minimum number of replicas to greater than zero to ensure that at least that many instances of your application will be running at all times, even during periods of inactivity. Zero minimum: Set the minimum to zero to allow your application to scale down to zero replicas when there are no active requests. However, your application will immediately scale up from zero when a request is received, upon initial deployment, and whenever load is predicted.

Maximum replicas

Set the highest number of active replicas for horizontal scaling.
Ensure resource allocation stays within desired boundaries, prevent excessive costs, and protect against uncontrolled scaling due to traffic spikes.

Concurrency limit

The concurrency limit defines the maximum number of requests a single replica can process simultaneously. It represents the parallel processing capacity of each replica. For example, a concurrency limit of three means each replica can handle up to three queries at the same time. The default setting is one, meaning each replica processes requests sequentially.

If you are using one of the SDKs, this concurrency is built in for you. However, if you are building a custom client, this value can be obtained from the MAX_CONCURRENT_TASKS environment variable.

Autoscaling

Autoscaling adjusts the number of active replicas of your model based on the current workload. In addition to setting minimum/maximum replica limits, the key parameters are scale-up load threshold and scale-down load threshold.

Scale-up load threshold

Scale-up load threshold is calculated as current running jobs / (current replicas * concurrency limit); if the load is greater than or equal to 0.75 for 1 minute, the deployment scales up one replica.

Scale-down load threshold

Scale-down load threshold is calculated in the same way as scale-up load threshold (current running jobs / (current replicas * concurrency limit)), but triggers scaling down one replica when the load is below 0.75 for 30 minutes.

Predictive scaling

Compute modules feature predictive scaling by tracking historic query load for your deployment. This system attempts to preemptively scale up to meet anticipated demand. If the prediction is inaccurate, the system will adjust and scale down. Predictive scaling respects your configured maximum number of replicas, so be sure to monitor your deployment's scaling over time and adjust your settings accordingly.