Create a media set batch pipeline with Pipeline Builder

In this tutorial, we will use Pipeline Builder to create a simple pipeline with media sets to extract text from PDF.

For this example, we use PDFs of publicly-available documents published by Palantir.

At the end of this tutorial, you will have a pipeline that looks like the following:

Screenshot of complete Pipeline builder

The pipeline will produce a new Object output of the extracted PDF text, which can be used for further exploration.

Part 1: Initial setup

First, we need to create a new pipeline.

  1. When logged into Foundry, access Pipeline Builder from the left navigation bar. If Pipeline Builder is not in the list of applications, select View all and find Pipeline Builder under the Build & Monitor Pipelines section.

    Screenshot of Pipeline builder link on navigation bar

  2. Next, on the top right of the Pipeline Builder landing page, create a new pipeline by selecting New pipeline. Then, choose Batch pipeline.

    Screenshot of Pipeline selection

    The ability to create a streaming pipeline is not available on all Foundry environments. Contact your Palantir representative for more information if your use case requires it.

  3. Select a location to save your pipeline. Note that pipelines cannot be saved in personal folders.

    Screenshot of Choose pipeline location popover

  4. Choose Create pipeline.

Part 2: Add media sets

Now we can add datasets to our pipeline workflow. For this tutorial, we will use PDFs of publicly-available documents from Palantir.

  1. From the Pipeline Builder page, select Add Foundry data on the home page.

    Screenshot of Choose pipeline location popover

    You can also select the Add data action on the top panel.

    Screenshot of Choose pipeline location popover

    Alternatively, you can drag and drop a file from your computer to use as your media set.

  2. If you selected Add data or Add Foundry data, you will be given the option to select your desired media sets.

    Screenshot of Add media set from location popover

  3. When all media set(s) are selected, choose Add data.

  4. When you have imported your media set you will be able to see the media set with thumbnail preview.

    Screenshot of imported media set

Part 3: Media set transformations

After adding raw media sets, we can perform some basic transformations. For this workflow we will be extracting the text from these PDF files.

Extract text from PDF

First, we will transform the Media set of Annual Letters media set. Choose the media references of the media items in the media set.

Get media references

  1. Choose the Media set of Annual Letters node in your graph.

  2. Select Transform.

    Screenshot of Media set of Annual Letters node

  3. Search for and select the Convert media set to table rows transform from the dropdown to open the board.

    Screenshot of media set to table board

  4. Select whether or not to include timestamp and deduplicate by path.

    Screenshot of media reference board

  5. Choose Apply to add the transform to your pipeline.

  6. Your output should look like this:

    Screenshot of Cast board

    Example media reference:

    {"mimeType":"application/pdf","reference":{"type":"mediaSetItem","mediaSetItem":{"mediaSetRid":"ri.mio.main.media-set.xxx","mediaItemRid":"ri.mio.main.media-item.xxx"}}}
    

    Example media item RID:

    ri.mio.main.media-item.xxx-xxx-xxx-xxx-xxxx
    

    Learn more about media references.

Extract text

  1. With media references, you can now select a new board that leverages media references. Search for and select the Text Extraction transform from the dropdown.

    Text extraction board

  2. Select the extract method (Raw text (PDF parsing) or OCR), the Media Reference column, OCR output format (if you chose OCR), and Languages/Scripts.

    Text Extraction options

  3. Choose Apply to add the transform to your pipeline.

  4. Your output should look like this when you hover over the extracted text:

    Text extraction output with Hover

    You can now run available string transformations on the extracted text column.

  5. Select Back to graph at the top right to return to your pipeline graph.

    Screenshot of the transform

(Optional) Semantic search workflow

If desired, you can continue with a semantic search workflow with your extracted text.

Part 4: Add an output

Now that we have finished extracting text from our PDFs and potentially running extra string transformations, we can add an output. For this tutorial, we will add an object output.

  1. In the Transforms node where you have completed your transformations, select Add output.

    Add output from media set transformation

  2. Select New object type.

    Add new object type

  3. Name your object type and set the Ontology by choosing Please select an ontology.

    Rename and set ontology output

  4. Select Edit and edit any column mapping. Ensure that you choose a valid column for the primary key.

    Edit column mapping

Part 5: Build the pipeline

  1. To build your pipeline, make sure to select Save, then Deploy > Deploy pipeline.

    Screenshot of scheme-filled dataset output pane

  2. You should see Intializing deployment under the Deploy Pipeline sidebar option.

    Initializing deployment

  3. Select View deployment history to track the progress of your deployment. You should be led to the History tab in your pipeline where you can view the statuses and history of your deployments:

    Deployment in progress

    Deployment complete

(Optional) Part 6: North of the Ontology

Once deployment has completed and your object is initialized, you should be able to directly action on your object output. Select Create Workshop module to generate a Workshop module with your pipeline output.

Create Workshop module

With this last step, we have generated our pipeline output and generated a Workshop module.