Extract text from PDF (using OCR)

Supported in: Batch

Run OCR on PDF files in a media set to extract text.

Expression categories: Media

Declared arguments

  • Languages to detect - Languages to detect in the input files.
    Set<Enum<Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Azerbaijani - Cyrilic, Basque, Belarusian, and more ...>>
  • Media reference - The column containing media references to PDF files in a media set.
    Expression<Media reference>
  • OCR output format - Output will be an array of strings. Each entry corresponds to one page of the PDF.
    Enum<Text, hOCR>
  • Scripts to detect - Scripts to detect in the input files.
    Set<Enum<Arabic, Armenian, Bengali, Canadian Aboriginal, Cherokee, Cyrillic, Devanagari, Ethiopic, Fraktur, Georgian, and more ...>>
  • optional End page - Page range end, inclusive. Defaults to the last page in the document. Supports negative indexing.
    Expression<Integer>
  • optional Error handling - Determines the behavior of the pipeline for inputs which fail to process. Fails fast by default.
    Enum<Fail fast, NULL on error>
  • optional Start page - Page range start, inclusive. Defaults to the first page (1) in the document.
    Expression<Integer>

Output type: Array<String>

Examples

Example 1: Base case

Argument values:

  • Languages to detect: {ENG}
  • Media reference: mediaReference
  • OCR output format: {TEXT}
  • Scripts to detect: {ARABIC}
  • End page: null
  • Error handling: FAIL_FAST
  • Start page: null
mediaReferenceOutput
{"mimeType":"application/pdf","reference":{"type":"mediaSetItem","mediaSetItem":{"mediaSetRid":"ri.mio.main.media-set.a", "mediaItemRid":"ri.mio.main.media-item.a"}}}[ This text came from the PDF document in the media set., So did this text. ]