Spark distributed model training

In distributed training, the training of a model takes place across multiple computing resources working in parallel. Foundry supports distributed model training in Spark environments. Distributed training enables you to:

  • Train models faster and more efficiently: Distributed training leverages multiple nodes concurrently to accelerate model training.
  • Scale up training datasets: By splitting training tasks and data across nodes, you can train models on much larger datasets than what single-node approaches can fit in memory.

Currently, only XGBoost Spark is directly supported for distributed model training in Foundry. Additional distributed training libraries may be supported in the future.

Follow the steps below to learn how to perform distributed model training in Foundry.

1. Configure your Spark environment

Before using distributed training libraries, you must configure your Spark environment properly using a Spark profile.

Required Spark profile

To perform distributed model training, you must import and apply the KUBERNETES_OPEN_PORTS_ALL profile in your repository, as shown in the example code below:

Copied!
1 2 3 4 5 6 7 8 from transforms.api import configure @configure(profile=[ "KUBERNETES_OPEN_PORTS_ALL", ]) @transform_df(...) def compute(...): ...

To reiterate: applying the KUBERNETES_OPEN_PORTS_ALL profile is mandatory for distributed training.

2. Set up distributed training libraries

Set up XGBoost Spark

XGBoost provides a seamless PySpark integration for distributed training via the SparkXGBClassifier ↗.

An example of basic setup for XGBoost Spark is as follows:

Copied!
1 2 3 4 5 6 7 8 9 10 from xgboost.spark import SparkXGBClassifier @configure(profile=["KUBERNETES_OPEN_PORTS_ALL"]) @transform(...) def compute(...): xgb = SparkXGBClassifier( features_col=<your_feature_col_name>, # other parameters as needed ) model = xgb.fit(<your_training_dataframe>)

For additional configuration details, refer to the XGBoost Spark Documentation ↗.

Set up GPU support for distributed model training

To leverage GPUs for distributed model training, follow the steps below:

  1. Add your project to a resource queue with GPU support.
  2. Enable the GPU profile by adding the EXECUTOR_GPU_ENABLED Spark profile to your transform.
  3. Configure the SparkXGBClassifier device parameter to 'gpu' or 'cuda', depending on your GPU setup.

Example:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 from transforms.api import configure @configure(profile=[ "KUBERNETES_OPEN_PORTS_ALL", "EXECUTOR_GPU_ENABLED", ]) @transform_df(...) def compute(...): xgb = SparkXGBClassifier( ..., device='gpu' # options: 'cpu', 'gpu', 'cuda' ) model = xgb.fit(...)

Refer to the SparkXGBClassifier documentation ↗ for more information on GPU configuration.

3. Distributed hyperparameter optimization

Optionally, you may want to perform hyperparameter optimization. We recommend using Optuna for hyperparameter optimization in Transforms.

  • Optuna integrates well with Spark and distributed training workflows without additional setup.
  • For more details, refer to the Optuna documentation ↗.