Train a binary classification model with scikit-learn in Code Repositories

The following documentation provides an example on how to train a scikit-learn binary classification model using the open source UCI ML Breast Cancer Wisconsin (Diagnostic) ↗ dataset in the Code Repositories application using the Model Training Template.

For a detailed walkthrough of the following steps, including how to author a model adapter and write Python transforms for model training, refer to our documentation on how to train a model in Code Repositories.

1. Author a model adapter

First, author a model adapter using the Model Training Template in Code Repositories.

The example logic below assumes the following:

  • This model adapter is initialized with a scikit-learn model.
  • The data being provided to this model is tabular.
  • The output of this model is tabular with all columns from columns, prediction, probability_0, and probability_1, where,
    • prediction is 0 or 1, with 0 being no cancer detected, and 1 being cancer detected.
    • probability_0 is the probability that cancer was not detected.
    • probability_1 is the probability that cancer was detected.
  • The following dependencies have been added to the repository: python 3.8.18, pandas 1.5.3, scikit-learn 1.3.2, and dill 0.3.7
Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 import palantir_models as pm from palantir_models_serializers import * class SklearnClassificationAdapter(pm.ModelAdapter): @pm.auto_serialize( model=DillSerializer() ) def __init__(self, model): self.model = model @classmethod def api(cls): columns = [ 'mean_radius', 'mean_texture', 'mean_perimeter', 'mean_area', 'mean_smoothness', 'mean_compactness', 'mean_concavity', 'mean_concave_points', 'mean_symmetry', 'mean_fractal_dimension', 'radius_error', 'texture_error', 'perimeter_error', 'area_error', 'smoothness_error', 'compactness_error', 'concavity_error', 'concave_points_error', 'symmetry_error', 'fractal_dimension_error', 'worst_radius', 'worst_texture', 'worst_perimeter', 'worst_area', 'worst_smoothness', 'worst_compactness', 'worst_concavity', 'worst_concave_points', 'worst_symmetry', 'worst_fractal_dimension' ] inputs = {"df_in": pm.Pandas(columns=columns)} outputs = {"df_out": pm.Pandas(columns= columns + [ ("prediction", int), ("probability_0", float), ("probability_1", float) ])} return inputs, outputs def predict(self, df_in): X = df_in.copy() predictions = self.model.predict(X) probabilities = self.model.predict_proba(X) df_in['prediction'] = predictions for idx, label in enumerate(self.model.classes_): df_in[f"probability_{label}"] = probabilities[:, idx] return df_in

2. Write a Python transform to train a model

In the same repository in model_training/model_training.py, author the model training logic.

This example uses the open source UCI ML Breast Cancer Wisconsin (Diagnostic) dataset ↗ provided in the scikit-learn library.

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 from transforms.api import transform from palantir_models.transforms import ModelOutput from main.model_adapters.adapter import SklearnClassificationAdapter from sklearn.datasets import load_breast_cancer from sklearn.compose import make_column_transformer from sklearn.impute import SimpleImputer from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestClassifier @transform( model_output=ModelOutput("/path/to/model_asset"), ) def compute(model_output): X_train, y_train = load_breast_cancer(as_frame=True, return_X_y=True) X_train.columns = X_train.columns.str.replace(' ', '_') columns = X_train.columns numeric_transformer = Pipeline( steps=[ ("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler()) ] ) preprocessor = make_column_transformer( (numeric_transformer, columns), remainder="passthrough" ) model = Pipeline( steps=[ ("preprocessor", preprocessor), ("classifier", RandomForestClassifier(n_estimators=50, max_depth=3)) ] ) model.fit(X_train, y_train) foundry_model = SklearnClassificationAdapter(model) model_output.publish(model_adapter=foundry_model)

3. Consume the Model

Run inference in a Python transform

You can run inference with your model in a Python transform. For example, once your model has been trained, copy the below inference logic into the model_training/run_inference.py file and select Build.

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 from transforms.api import transform, Output from palantir_models.transforms import ModelInput from sklearn.datasets import load_breast_cancer @transform( inference_output=Output("ri.foundry.main.dataset.5dd9907f-79bc-4ae9-a106-1fa87ff021c3"), model=ModelInput("ri.models.main.model.cfc11519-28be-4f3e-9176-9afe91ecf3e1"), ) def compute(inference_output, model): X, y = load_breast_cancer(as_frame=True, return_X_y=True) X.columns = X.columns.str.replace(' ', '_') inference_results = model.transform(X) inference_output.write_pandas(inference_results.df_out)

Perform live inference in a modeling objective

A Palantir model can be submitted to a modeling objective for the following:

After submitting this model to a modeling objective, you can launch a sandbox deployment to host this model for live inference. Once the sandbox is launched and ready, you can perform live inference and connect this model to an operational application.

The example below shows input for the binary classification model using the single I/O endpoint:

[
  {
    "mean_radius": 15.09,
    "mean_texture": 23.71,
    "mean_perimeter": 92.65,
    "mean_area": 944.07,
    "mean_smoothness": 0.53,
    "mean_compactness": 0.21,
    "mean_concavity": 0.76,
    "mean_concave_points": 0.39,
    "mean_symmetry": 0.08,
    "mean_fractal_dimension": 0.14,
    "radius_error": 0.49,
    "texture_error": 0.82,
    "perimeter_error": 2.51,
    "area_error": 17.22,
    "smoothness_error": 0.07,
    "compactness_error": 0.01,
    "concavity_error": 0.05,
    "concave_points_error": 0.05,
    "symmetry_error": 0.01,
    "fractal_dimension_error": 0.08,
    "worst_radius": 12.95,
    "worst_texture": 20.66,
    "worst_perimeter": 185.41,
    "worst_area": 624.87,
    "worst_smoothness": 0.18,
    "worst_compactness": 0.26,
    "worst_concavity": 0.01,
    "worst_concave_points": 0.05,
    "worst_symmetry": 0.29,
    "worst_fractal_dimension": 0.05
  }
]

A sandbox deployment for a scikit-learn binary classification model.