The below documentation describes the foundry_ml
library which is no longer recommended for use in the platform. Instead, use the palantir_models
library. You can also learn how to migrate a model from the foundry_ml
to the palantir_models
framework through an example.
The foundry_ml
library will be removed on October 31, 2025, corresponding with the planned deprecation of Python 3.9.
spaCy ↗ is a popular open-source software library for advanced Natural Language Processing (NLP). This example walks you through registering a spaCy model with the foundry_ml
Stage registry. This code follows the example on spaCy's website for Customizing spaCy’s Tokenizer class ↗ and leverages the Named Entity Recognition model.
This tutorial builds upon the fundamental building blocks of Model integration in Foundry, and assumes you are familiar with the following:
foundry_ml
Stage registry. See adding additional library support documentation for more information. Note that this tutorial implements most of the steps detailed in that documentation.This tutorial assumes the pre-trained spaCy language model is available in your authoring environment. If you do not see it available, you may need to either enable access through an open-source conda channel or set up your own private conda channel to make this available for your authoring environment. Contact your Palantir representative for guidance.
We want to create a model that takes in a dataframe with column "text" and applies the model to each row. To do this, we need to define a wrapper class, SpacyNERModel
, which will register the transform function, serializer, and deserializer. All of these decorated functions are necessary in order for Foundry to properly serialize and execute the model.
The very first step is to create a new Code Repository of type "Python Library" and set up the project structure. In this example, we've created spacy-custom-ner-stage
repository with a spacy_custom_stage
folder with two Python files, model.py
and model_class_registry.py
. Note that you do not have to separate the code defined below into two files, but we have done so here to provide a cleaner conceptual split.
├── conda_recipe
│ └── meta.yaml
├── settings.gradle
├── src
│ ├──spacy_custom_stage
│ │ ├── __init__.py
│ │ └── model.py
│ │ └── model_class_registry.py
│ ├── setup.cfg
│ └── setup.py
└── README.md
In the model.py
file, we will follow the spaCy tutorial to write a custom tokenizer and define our new model class:
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
import spacy from spacy.tokenizer import Tokenizer # create custom tokenizer def custom_tokenizer(nlp): import re special_cases = {":)": [{"ORTH": ":)"}]} prefix_re = re.compile(r'''^[[("']''') suffix_re = re.compile(r'''[])"']$''') infix_re = re.compile(r'''[-~]''') simple_url_re = re.compile(r'''^https?://''') return Tokenizer(nlp.vocab, rules=special_cases, prefix_search=prefix_re.search, suffix_search=suffix_re.search, infix_finditer=infix_re.finditer, url_match=simple_url_re.match) # define wrapper class for foundry_ml registry class SpacyNERModel(): # Takes in loaded language model as input and custom tokenizer def __init__(self, model_name): self.spacy = spacy.load(model_name) self.spacy.tokenizer = custom_tokenizer(self.spacy) # Returns extracted entities using the loaded SpaCy model # This function can be adapted to perform different NLP tasks def predict(self, text): doc = self.spacy(text) results = [(ent.text, ent.start_char, ent.end_char, ent.label_) for ent in doc.ents] return results # define predict function that operates on a dataframe, makes assumptions that a "text" column exists in the dataframe def predict_df(self, df): df["entities"] = df["text"].apply(self.predict) return df
In the model_class_registry.py
file, we will implement the serialize
, deserialize
, and transform
functions.
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
import dill import os import tempfile # import newly created model class from spacy_custom_stage.model import SpacyNERModel # import Foundry python functions from foundry_ml.stage.flexible import stage_transform, register_stage_transform_for_class from foundry_ml.stage.serialization import deserializer, serializer, register_serializer_for_class from foundry_object.utils import safe_upload_file, download_file # Annotate a function that will wrap the model and data passed between stages. @stage_transform() def _transform(model, df): return model.predict_df(df) # Call this to send to python Stage Registry, force=True to override any existing registered transform register_stage_transform_for_class(SpacyNERModel, _transform, force=True) # Deserializer decorator @deserializer("spacy_ner_model.dill", force=True) def _deserializer(filesystem, path): with tempfile.TemporaryDirectory() as tmpdir: local_path = os.path.join(tmpdir, "file.dill") download_file(filesystem, path, local_path) with open(local_path) as f: model = dill.load(f) return model # Serializer decorator @serializer(_deserializer) def _serializer(filesystem, value): path = 'spacy_ner_model.dill' with tempfile.NamedTemporaryFile() as tmp: dill.dump(value, tmp) safe_upload_file(filesystem, tmp.name, path, base64_encode=True) return path register_serializer_for_class(SpacyNERModel, _serializer, force=True)
Next, we need to make sure that all the proper packages are added into the meta.yaml
file. In this example's case, the section of your file should look something similar to this:
# Any packages required to run your package
run:
- python 3.8.*
- spacy
- foundry_ml
- dill
Now, import the contents of the two files we just created into the __init__.py
. Note, you don't have to follow the exact convention below.
Copied!1 2 3 4
from spacy_custom_stage.model import SpacyNERModel from .model_class_registry import * __all__ = ["SpacyNERModel"]
Make sure to add the following code in the setup.py
file inside the setup()
.
Copied!1
entry_points={'foundry_ml.plugins': ['plugin=spacy_custom_stage']}
Once you commit, build, and tag a release, your new model class should be available to leverage in Code Workbook or Code Repositories.
With your custom library saved and published, you can now create, save, and use a SpaCy model. This step can be done using Code Repositories or Code Workbook. The screenshots and code snippets below are from Code Workbook.
Again, note that en_core_web_sm
is a pre-trained spaCy language model and can be imported into your Python environment as a Conda package.
Ensure that you add versions 2.3.x of spacy-model-en_core_web_md and spaCy as dependencies to your coding environment before running your code.
The Python model class we just created should now be available to import into your Python environment as spacy-custom-ner-stage
. Note that if you're developing in a Code Repository, you will need to add spacy-custom-ner-stage
as a backing repository as well.
To create a model:
Copied!1 2 3 4 5 6 7 8 9 10 11 12
import spacy from foundry_ml import Model, Stage # import our new model class from spacy_custom_stage import SpacyNERModel def spacy_model(): # pass in a spacy model with vectors model = SpacyNERModel('en_core_web_sm') return Model(Stage(model))
Now to apply the model:
Copied!1 2 3 4 5 6 7 8 9 10
def model_inference(spacy_model): import pandas as pd # example dataset df = pd.DataFrame({"text": ["The White House is a white building in Washington D.C.", "Cats is a Broadway musical in New York"]}) output = spacy_model.transform(df) return output