PDF handling

This page offers a basic guide for using Pipeline Builder to parse PDFs for semantic search and includes a recommendation for presenting the information in a Workshop app.

Semantic search is a powerful tool to use with PDFs, particularly if the content is broken down into smaller "chunks" that are embedded separately, helping users and workflows find important information that might otherwise be hard to access. This is especially useful considering the vast amount of unstructured knowledge in PDFs that often goes unnoticed.

To use, simply upload your PDFs to Foundry, extract the text, chunk the same text, search for those chunks, and surface the results of that search with the corresponding PDF rendered on the side for source-of-truth cross-validation for the users.

Set up semantic search to search within PDFs

Follow the steps outlined below to import PDFs and establish semantic search for surfacing content from the PDFs:

  1. Import the PDFs as a media set
  2. Add the media set to in Pipeline Builder
  3. Use the Get Media References board.

Get Media References board

  1. Use the Text Extraction board.

Text Extraction board

  1. Follow a chunking strategy.
  2. Create the chunk objects with a media reference property.
  3. Search for the chunk as part of a semantic search workflow.
  4. Use the PDF Viewer widget in the workshop, noting the configuration options.