In this tutorial, we will use Pipeline Builder to create a simple pipeline with media sets to extract text from PDF.
For this example, we use PDFs of publicly-available documents published by Palantir.
At the end of this tutorial, you will have a pipeline that looks like the following:
The pipeline will produce a new Object output of the extracted PDF text, which can be used for further exploration.
First, we need to create a new pipeline.
When logged into Foundry, access Pipeline Builder from the left navigation bar. If Pipeline Builder is not in the list of applications, select View all and find Pipeline Builder under the Build & Monitor Pipelines section.
Next, on the top right of the Pipeline Builder landing page, create a new pipeline by selecting New pipeline. Then, choose Batch pipeline.
The ability to create a streaming pipeline is not available on all Foundry environments. Contact your Palantir representative for more information if your use case requires it.
Select a location to save your pipeline. Note that pipelines cannot be saved in personal folders.
Choose Create pipeline.
Now we can add datasets to our pipeline workflow. For this tutorial, we will use PDFs of publicly-available documents from Palantir.
From the Pipeline Builder page, select Add Foundry data on the home page.
You can also select the Add data action on the top panel.
Alternatively, you can drag and drop a file from your computer to use as your media set.
If you selected Add data or Add Foundry data, you will be given the option to select your desired media sets.
When all media set(s) are selected, choose Add data.
When you have imported your media set you will be able to see the media set with thumbnail preview.
After adding raw media sets, we can perform some basic transformations. For this workflow we will be extracting the text from these PDF files.
First, we will transform the Media set of Annual Letters
media set. Choose the media references of the media items in the media set.
Choose the Media set of Annual Letters
node in your graph.
Select Transform.
Search for and select the Convert media set to table rows transform from the dropdown to open the board.
Select whether or not to include timestamp
and deduplicate by path
.
Choose Apply to add the transform to your pipeline.
Your output should look like this:
Example media reference:
{"mimeType":"application/pdf","reference":{"type":"mediaSetItem","mediaSetItem":{"mediaSetRid":"ri.mio.main.media-set.xxx","mediaItemRid":"ri.mio.main.media-item.xxx"}}}
Example media item RID:
ri.mio.main.media-item.xxx-xxx-xxx-xxx-xxxx
With media references, you can now select a new board that leverages media references. Search for and select the Text Extraction transform from the dropdown.
Select the extract method (Raw text
(PDF parsing) or OCR
), the Media Reference
column, OCR output format
(if you chose OCR), and Languages/Scripts
.
Choose Apply to add the transform to your pipeline.
Your output should look like this when you hover over the extracted text:
You can now run available string transformations on the extracted text column.
Select Back to graph at the top right to return to your pipeline graph.
If desired, you can continue with a semantic search workflow with your extracted text.
Now that we have finished extracting text from our PDFs and potentially running extra string transformations, we can add an output. For this tutorial, we will add an object output.
In the Transforms
node where you have completed your transformations, select Add output.
Select New object type.
Name your object type and set the Ontology by choosing Please select an ontology.
Select Edit and edit any column mapping. Ensure that you choose a valid column for the primary key.
To build your pipeline, make sure to select Save, then Deploy > Deploy pipeline.
You should see Intializing deployment
under the Deploy Pipeline
sidebar option.
Select View deployment history to track the progress of your deployment. You should be led to the History
tab in your pipeline where you can view the statuses and history of your deployments:
Once deployment has completed and your object is initialized, you should be able to directly action on your object output. Select Create Workshop module to generate a Workshop module with your pipeline output.
With this last step, we have generated our pipeline output and generated a Workshop module.