Example: spaCy NLP

Sunsetted functionality

The below documentation describes the foundry_ml library — used to produce dataset-backed models — which is no longer recommended for use in the platform. Instead, use the palantir_models library to produce model assets. You can also learn how to migrate a model from the foundry_ml to the palantir_models framework through an example.

The foundry_ml library will be removed on October 31, 2025, corresponding with the planned deprecation of Python 3.9.

spaCy ↗ is a popular open-source software library for advanced Natural Language Processing (NLP). This example walks you through registering a spaCy model with the foundry_ml Stage registry. This code follows the example on spaCy's website for Customizing spaCy’s Tokenizer class ↗ and leverages the Named Entity Recognition model.

Prerequisites

This tutorial builds upon the fundamental building blocks of Model integration in Foundry, and assumes you are familiar with the following:

Understanding of Python models in Foundry.
Understanding of how to add custom stages to the foundry_ml Stage registry. See adding additional library support documentation for more information. Note that this tutorial implements most of the steps detailed in that documentation.
Understanding of how Code Repositories manages its Python environment.

This tutorial assumes the pre-trained spaCy language model is available in your authoring environment. If you do not see it available, you may need to either enable access through an open-source conda channel or set up your own private conda channel to make this available for your authoring environment. Contact your Palantir representative for guidance.

Step 1: Define custom tokenizer and Model class

We want to create a model that takes in a dataframe with column "text" and applies the model to each row. To do this, we need to define a wrapper class, SpacyNERModel, which will register the transform function, serializer, and deserializer. All of these decorated functions are necessary in order for Foundry to properly serialize and execute the model.

The very first step is to create a new Code Repository of type "Python Library" and set up the project structure. In this example, we've created spacy-custom-ner-stage repository with a spacy_custom_stage folder with two Python files, model.py and model_class_registry.py. Note that you do not have to separate the code defined below into two files, but we have done so here to provide a cleaner conceptual split.

├── conda_recipe
│   └── meta.yaml
├── settings.gradle
├── src
│   ├──spacy_custom_stage
│   │   ├── __init__.py
│   │   └── model.py
│   │   └── model_class_registry.py
│   ├── setup.cfg
│   └── setup.py
└── README.md

In the model.py file, we will follow the spaCy tutorial to write a custom tokenizer and define our new model class:

Copied!1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import spacy
from spacy.tokenizer import Tokenizer


# create custom tokenizer
def custom_tokenizer(nlp):
    import re

    special_cases = {":)": [{"ORTH": ":)"}]}
    prefix_re = re.compile(r'''^[[("']''')
    suffix_re = re.compile(r'''[])"']$''')
    infix_re = re.compile(r'''[-~]''')
    simple_url_re = re.compile(r'''^https?://''')

    return Tokenizer(nlp.vocab, rules=special_cases,
                                prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                url_match=simple_url_re.match)


# define wrapper class for foundry_ml registry
class SpacyNERModel():

    # Takes in loaded language model as input and custom tokenizer
    def __init__(self, model_name):
        self.spacy = spacy.load(model_name)
        self.spacy.tokenizer = custom_tokenizer(self.spacy)

    # Returns extracted entities using the loaded SpaCy model
    # This function can be adapted to perform different NLP tasks
    def predict(self, text):
        doc = self.spacy(text)
        results = [(ent.text, ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]
        return results

    # define predict function that operates on a dataframe, makes assumptions that a "text" column exists in the dataframe
    def predict_df(self, df):
        df["entities"] = df["text"].apply(self.predict)
        return df

Step 2: Ensure your model class is discoverable by Foundry

In the model_class_registry.py file, we will implement the serialize, deserialize, and transform functions.

Copied!1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import dill
import os
import tempfile


# import newly created model class
from spacy_custom_stage.model import SpacyNERModel

# import Foundry python functions
from foundry_ml.stage.flexible import stage_transform, register_stage_transform_for_class
from foundry_ml.stage.serialization import deserializer, serializer, register_serializer_for_class
from foundry_object.utils import safe_upload_file, download_file


# Annotate a function that will wrap the model and data passed between stages.
@stage_transform()
def _transform(model, df):
    return model.predict_df(df)


# Call this to send to python Stage Registry, force=True to override any existing registered transform
register_stage_transform_for_class(SpacyNERModel, _transform, force=True)


# Deserializer decorator
@deserializer("spacy_ner_model.dill", force=True)
def _deserializer(filesystem, path):

    with tempfile.TemporaryDirectory() as tmpdir:
        local_path = os.path.join(tmpdir, "file.dill")
        download_file(filesystem, path, local_path)
        with open(local_path) as f:
            model = dill.load(f)

    return model


# Serializer decorator
@serializer(_deserializer)
def _serializer(filesystem, value):

    path = 'spacy_ner_model.dill'
    with tempfile.NamedTemporaryFile() as tmp:
        dill.dump(value, tmp)
        safe_upload_file(filesystem, tmp.name, path, base64_encode=True)

    return path


register_serializer_for_class(SpacyNERModel, _serializer, force=True)

Next, we need to make sure that all the proper packages are added into the meta.yaml file. In this example's case, the section of your file should look something similar to this:

  # Any packages required to run your package
  run:
    - python 3.8.*
    - spacy
    - foundry_ml
    - dill

Now, import the contents of the two files we just created into the __init__.py. Note, you don't have to follow the exact convention below.

Copied!1
2
3
4
from spacy_custom_stage.model import SpacyNERModel
from .model_class_registry import *

__all__ = ["SpacyNERModel"]

Make sure to add the following code in the setup.py file inside the setup().

Copied!1
entry_points={'foundry_ml.plugins': ['plugin=spacy_custom_stage']}

Once you commit, build, and tag a release, your new model class should be available to leverage in Code Workbook or Code Repositories.

Step 3: Create and use model

With your custom library saved and published, you can now create, save, and use a SpaCy model. This step can be done using Code Repositories or Code Workbook. The screenshots and code snippets below are from Code Workbook.

Again, note that en_core_web_sm is a pre-trained spaCy language model and can be imported into your Python environment as a Conda package.

Ensure that you add versions 2.3.x of spacy-model-en_core_web_md and spaCy as dependencies to your coding environment before running your code.

The Python model class we just created should now be available to import into your Python environment as spacy-custom-ner-stage. Note that if you're developing in a Code Repository, you will need to add spacy-custom-ner-stage as a backing repository as well.

To create a model:

Copied!1
2
3
4
5
6
7
8
9
10
11
12
import spacy
from foundry_ml import Model, Stage

# import our new model class
from spacy_custom_stage import SpacyNERModel

def spacy_model():

    # pass in a spacy model with vectors
    model = SpacyNERModel('en_core_web_sm')

    return Model(Stage(model))

Now to apply the model:

Copied!1
2
3
4
5
6
7
8
9
10
def model_inference(spacy_model):

    import pandas as pd

    # example dataset
    df = pd.DataFrame({"text": ["The White House is a white building in Washington D.C.", "Cats is a Broadway musical in New York"]})

    output = spacy_model.transform(df)

    return output

←

PREVIOUSUse scikit-learn and SparkML pipelines

NEXTSearch natively supported libraries

→