Perform a GridSearch

Planned deprecation

The foundry_ml library, which is used to produce dataset-backed models, is in the planned deprecation phase of development and will be unavailable for use starting October 31, 2025. Full support remains available until the deprecation date. At this time, you should use the palantir_models library to produce model assets. You can also learn how to migrate a model from the foundry_ml to the palantir_models framework through an example.

Contact Palantir Support if you require additional help migrating your workflows.

When developing models, you may want to experiment with hyperparameters until you find an effective combination of values. Within Foundry, you can use common libraries or custom code to perform hyperparameter optimization during model training as part of a training job, and save down one or several "winning" models along with metrics and metadata.

This example shows how to use scikit-learn's GridSearchCV via Python transforms within Code Repositories. This wraps a scikit-learn compatible model with a "fittable" class, which performs cross-validation across a parameter set and yields the best model.

The shown implementation performs gridsearch on a single, large driver. Refer to the Spark profiles documentation to see how to enable profiles for a repository. It is also possible to leverage Spark to implement a distributed gridsearch (e.g., using pandas_udf and a training function), although this isn't implemented below.

The multi-output format can also be useful for training multiple models in one transformation.

The example below extracts the top model and saves down the relevant evaluation metrics. This example uses the housing data from the modeling objective tutorial.

Copied!1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
from transforms.api import transform, Input, Output, configure
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

import pandas as pd

from foundry_ml_sklearn.utils import extract_matrix
from foundry_ml import Model, Stage
from foundry_ml_metrics import MetricSet

@configure(profile=['DRIVER_CORES_EXTRA_LARGE', 'DRIVER_MEMORY_OVERHEAD_EXTRA_LARGE'])
@transform(
    training_data=Input('/Public/.../training_data'),
    out_model=Output('/Public/.../mo_model'),
    out_metrics=Output('/Public/.../mo_training_metrics'),
)
def brown_hair_by_sex(training_data, out_model, out_metrics):
    training = training_data.dataframe().toPandas()

    # The passthrough option combines the numeric columns into a feature vector
    column_transformer = make_column_transformer(
        (StandardScaler(), ['median_income', "housing_median_age", "total_rooms"])

    )
    # Fit the column transformer to act as a vectorizer
    column_transformer.fit(training)
    # Wrap the vectorizer as a Stage to indicate this is the transformation that will be applied in the Model
    vectorizer = Stage(column_transformer)
    training_df = vectorizer.transform(training)

    # Invoke a helper function to convert a column of vectors into a NumPy matrix and handle sparsity
    X = extract_matrix(training_df, 'features')
    y = training_df['median_house_value']

    # Create parameter grid of options
    param_grid = [
        {'n_estimators': [3, 10, 30], 'max_features': [2, 3]},
        {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3]},
    ]

    # select algorithm and pick metrics to cross-validate on
    forest_reg = RandomForestRegressor()
    grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                               scoring='neg_mean_squared_error')
    grid_search.fit(X, y)

    # get results of internal cv - the best model
    cv_results = pd.DataFrame(grid_search.cv_results_).select_dtypes(include=['number'])
    # extract metrics and turn into dictionary
    cv_results = cv_results[cv_results['rank_test_score'] == 1].to_dict('records')[0]

    # Return Model object that now contains a pipeline of transformations
    model = Model(vectorizer, Stage(grid_search.best_estimator_, input_column_name='features'))

    # Save best scores as metrics
    metric_set = MetricSet(
        model = model,
        input_data=training_data
    )

    # add metrics to metricset
    for key, value in cv_results.items():
        metric_set.add(name=key, value=value)

    model.save(out_model)
    metric_set.save(out_metrics)

←

PREVIOUSAdd support for a modeling library

NEXTAuthor models in TypeScript

→