Perform a GridSearch

Sunsetted functionality

The below documentation describes the foundry_ml library which is no longer recommended for use in the platform. Instead, use the palantir_models library. You can also learn how to migrate a model from the foundry_ml to the palantir_models framework through an example.

The foundry_ml library will be removed on October 31, 2025, corresponding with the planned deprecation of Python 3.9.

When developing models, you may want to experiment with hyperparameters until you find an effective combination of values. Within Foundry, you can use common libraries or custom code to perform hyperparameter optimization during model training as part of a training job, and save down one or several "winning" models along with metrics and metadata.

This example shows how to use scikit-learn's GridSearchCV via Python transforms within Code Repositories. This wraps a scikit-learn compatible model with a "fittable" class, which performs cross-validation across a parameter set and yields the best model.

The shown implementation performs gridsearch on a single, large driver. Refer to the Spark profiles documentation to see how to enable profiles for a repository. It is also possible to leverage Spark to implement a distributed gridsearch (e.g., using pandas_udf and a training function), although this isn't implemented below.

The multi-output format can also be useful for training multiple models in one transformation.

The example below extracts the top model and saves down the relevant evaluation metrics. This example uses the housing data from the modeling objective tutorial.

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 from transforms.api import transform, Input, Output, configure from sklearn.compose import make_column_transformer from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import GridSearchCV import pandas as pd from foundry_ml_sklearn.utils import extract_matrix from foundry_ml import Model, Stage from foundry_ml_metrics import MetricSet @configure(profile=['DRIVER_CORES_EXTRA_LARGE', 'DRIVER_MEMORY_OVERHEAD_EXTRA_LARGE']) @transform( training_data=Input('/Public/.../training_data'), out_model=Output('/Public/.../mo_model'), out_metrics=Output('/Public/.../mo_training_metrics'), ) def brown_hair_by_sex(training_data, out_model, out_metrics): training = training_data.dataframe().toPandas() # The passthrough option combines the numeric columns into a feature vector column_transformer = make_column_transformer( (StandardScaler(), ['median_income', "housing_median_age", "total_rooms"]) ) # Fit the column transformer to act as a vectorizer column_transformer.fit(training) # Wrap the vectorizer as a Stage to indicate this is the transformation that will be applied in the Model vectorizer = Stage(column_transformer) training_df = vectorizer.transform(training) # Invoke a helper function to convert a column of vectors into a NumPy matrix and handle sparsity X = extract_matrix(training_df, 'features') y = training_df['median_house_value'] # Create parameter grid of options param_grid = [ {'n_estimators': [3, 10, 30], 'max_features': [2, 3]}, {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3]}, ] # select algorithm and pick metrics to cross-validate on forest_reg = RandomForestRegressor() grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error') grid_search.fit(X, y) # get results of internal cv - the best model cv_results = pd.DataFrame(grid_search.cv_results_).select_dtypes(include=['number']) # extract metrics and turn into dictionary cv_results = cv_results[cv_results['rank_test_score'] == 1].to_dict('records')[0] # Return Model object that now contains a pipeline of transformations model = Model(vectorizer, Stage(grid_search.best_estimator_, input_column_name='features')) # Save best scores as metrics metric_set = MetricSet( model = model, input_data=training_data ) # add metrics to metricset for key, value in cv_results.items(): metric_set.add(name=key, value=value) model.save(out_model) metric_set.save(out_metrics)