Use media sets with Python transforms

You can use media sets in Python transforms for PDF text extraction, optical character recognition (OCR), image tiling, metadata parsing, and more. The following sections explain how to set up media sets in your Python repository and how to read to and write from media sets with Python transforms.

Media transformations are currently not supported in Code Repository's Preview functionality. Any transforms utilizing media sets can be built but not previewed.

Import the transforms-media library into your repository

To use decorators specific to media sets, you first need to import the transforms-media library into your repository. You can do this by navigating to the Libraries file drawer on the left side of the Code Repositories interface. Search for transforms-media, then install the library.

Add a dependency on transforms-media in your code repository.

You must use the @transform decorator when working with media sets. Media set inputs and outputs can be passed in using transforms.mediasets.MediaSetInput and transforms.mediasets.MediaSetOutput specifications. During a build, these specifications are resolved into transforms.mediasets.MediaSetInputParam and transforms.mediasets.MediaSetOutputParam objects, respectively. These MediaSetInputParam and MediaSetOutputParam objects provide access to the media set within the compute function. Any number of media set inputs or outputs can be used in combination with any other valid transform inputs and outputs (such as tabular datasets). For example:

Copied!
1 2 3 4 5 6 7 8 9 10 from transforms.api import transform from transforms.mediasets import MediaSetInput, MediaSetOutput @transform( images=MediaSetInput('/examples/images'), output_images=MediaSetOutput('/examples/output_images') ) def translate_images(images, output_images): ...

Read from media sets

You can access individual media items either by the file path or RID:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 from transforms.api import transform from transforms.mediasets import MediaSetInput, MediaSetOutput @transform( images=MediaSetInput('/examples/images'), output_images=MediaSetOutput('/examples/output_images') ) def translate_images(images, output_images): image1 = images.get_media_item_by_path("image1") image2 = images.get_media_item("ri.mio.main.media-item.123") ...

However, you will likely want to transform all the items in your media set. To do this, you must first pull the items into a dataframe using a listing method. In the example below, we list all items in the input media set and write the resulting dataframe to a tabular output:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 from transforms.api import transform, Output from transforms.mediasets import MediaSetInput @transform( images=MediaSetInput('/examples/images'), listing_output=Output('/examples/listed_images') ) def translate_images(ctx, images, listing_output): media_items_listing = images.list_media_items_by_path_with_media_reference(ctx) # You can perform regular PySpark transformations on media_items_listing column_typeclasses = {'mediaReference': [{'kind': 'reference', 'name': 'media_reference'}]} listing_output.write_dataframe(media_items_listing, column_typeclasses=column_typeclasses)

If multiple items in the media set are at a particular path, only the most recent will be included in the listing. The listing will have the following schema:

+--------------------------+-----------+-------------------+
|        mediaItemRid      |    path   |  mediaReference  |
+--------------------------+-----------+-------------------+
| ri.mio.main.media-item.1 | item1.jpg |  {{reference1}}   |
| ri.mio.main.media-item.2 | item2.jpg |  {{reference2}}   |
| ri.mio.main.media-item.3 | item3.jpg |  {{reference3}}   |
+--------------------------+-----------+-------------------+

Note that the above example only shows the top three rows of the listing.

By setting the typeclass of the mediaReference column, we allow the column to be read as a media reference.

Calls to get_media_item(), get_media_item_by_path(), and so on return a Python file-like stream object. All options accepted by io.open() are also supported. Note that items are read as streams, meaning that random access is not supported.

You can also return metadata about individual media items without downloading the full item. The metadata will include information such as the dimensions for images, length for audio, and more. For a full reference of available metadata, see the appendix below. The example below adds a column to the media item listing with the metadata for each image.

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 from transforms.api import transform, Output from transforms.mediasets import MediaSetInput from pyspark.sql import functions as F from pyspark.sql.types import StringType from conjure_python_client import ConjureEncoder @transform( images=MediaSetInput('/examples/images'), listing_output_with_metadata=Output('/examples/listed_images_with_metadata') ) def translate_images(ctx, images, listing_output_with_metadata): def get_metadata(media_item_rid): metadata = images.get_media_item_metadata(media_item_rid) return ConjureEncoder().default(metadata) metadata_udf = F.udf(get_metadata, StringType()) media_items_listing = images.list_media_items_by_path_with_media_reference(ctx) listing_with_metadata = media_items_listing.withColumn('metadata', metadata_udf(F.col('mediaItemRid'))) column_typeclasses = {'mediaReference': [{'kind': 'reference', 'name': 'media_reference'}]} listing_output_with_metadata.write_dataframe(listing_with_metadata, column_typeclasses=column_typeclasses)

Media sets support a certain number of built-in transformations out of the box. See the appendix below for the API and list of supported transformations. Calls to these transformations will also return a Python file-like stream object. To use these built-in transformations, call the appropriate method on the media set input. For example:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 @transform( images=MediaSetInput('/examples/images'), image_text_output=Output('/examples/listed_images_with_text') ) def translate_images(ctx, images, image_text_output): def get_ocr_on_image(media_item_rid): return images.transform_image_to_text_ocr_output_text(media_item_rid).read().decode('utf-8') ocr_on_image_udf = F.udf(get_ocr_on_image, StringType()) media_items_listing = images.list_media_items_by_path_with_media_reference(ctx) listing_with_ocr = media_items_listing.withColumn('text', ocr_on_image_udf(F.col('mediaItemRid'))) column_typeclasses = {'mediaReference': [{'kind': 'reference', 'name': 'media_reference'}]} image_text_output.write_dataframe(listing_with_ocr, column_typeclasses=column_typeclasses)

Create a media set

You can create a media set to be used as an output of your Python transform directly within Code Repositories.

First, choose an output location and a name for the new media set in the MediaSetOutput specification.

Select the path you have just defined, which at this point should be underlined in red; on hover, you should see an error message indicating the media set does not exist.

From the lightbulb icon on the left side of the line, select Create media set.

Create media set output prompt

Go through the dialog steps to choose the desired media set schema and complete any other required configuration on your new media set.

Create media set dialog

After selecting Create, the MediaSetOutput specification will be populated with the details you've provided. These annotation fields define how the media set will be created.

Create media set annotations

The new media set will be created after the Python transform has been built for the first time, after which the annotation fields should not be edited.

Write to media sets

Media sets can be used as outputs to transformations by using the MediaSetOutput specification.

To upload an item, call the put_media_item() endpoint on the output media set. This endpoint accepts any file-like object and a path which will be used to identify the item in the output media set. The following is a basic example:

Copied!
1 2 3 4 5 6 7 8 9 10 11 from transforms.api import transform from transforms.mediasets import MediaSetInput, MediaSetOutput @transform( images=MediaSetInput('/examples/images'), output_images=MediaSetOutput('/examples/output_images') ) def upload_images(images, output_images): with images.get_media_item_by_path("image1.jpg") as input_image: output_images.put_media_item(input_image, "copied_image1.jpg")

When copying items from one media set to another, you can use the fast_copy_media_item() method on the output. This is a faster and more efficient option than downloading and re-uploading the media item:

Copied!
1 2 3 4 5 6 7 @transform( images=MediaSetInput('/examples/images'), output_images=MediaSetOutput('/examples/output_images') ) def upload_images(images, output_images): origin_media_item_rid = images.get_media_item_rid_by_path("image1.jpg").item output_images.fast_copy_media_item(images, origin_media_item_rid, "fast_copied_image1.jpg")

Items can be uploaded to media sets in user-defined functions (UDFs) for higher parallelism. In the example below, we transform the PDFs in the input media set into JPEGs using the built-in PDF to JPEG transformation and upload those JPEGs to a new output media set. We then write out a tabular dataset containing the media references of those uploaded JPEGs:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 from transforms.api import transform, Output from transforms.mediasets import MediaSetInput, MediaSetOutput from pyspark.sql import functions as F from pyspark.sql.types import StringType @transform( pdfs=MediaSetInput('/examples/PDFs'), output_images=MediaSetOutput('/examples/JPEGs'), output_references=Output('/examples/JPEG listing') ) def upload_images(ctx, pdfs, output_images, output_references): def upload_jpg(media_item_rid, path): with pdfs.transform_document_to_jpg(media_item_rid, 0) as jpeg: response = output_images.put_media_item(jpeg, path) return response.media_item_rid upload_udf = F.udf(upload_jpg, StringType()) listed_pdfs = pdfs.list_media_items_by_path(ctx) media_reference_template = output_images.media_reference_template() uploaded_jpegs = listed_pdfs\ .withColumn('uploaded_media_item_rid', upload_udf(F.col('mediaItemRid'), F.col('path')))\ .select('path', 'uploaded_media_item_rid')\ .withColumn("mediaReference", F.format_string(media_reference_template, 'uploaded_media_item_rid')) column_typeclasses = {'mediaReference': [{'kind': 'reference', 'name': 'media_reference'}]} output_references.write_dataframe(uploaded_jpegs, column_typeclasses=column_typeclasses)

Media set write modes

Media sets can be written to using one of two write modes:

  • modify: Uploaded items will be added in addition to the existing items in the media set branch.
  • replace: Uploaded items will fully replace the media set branch.

The default write mode depends on the transaction policy of the media set. Transactional media sets default to replace. Transactionless media sets use the modify write mode and this cannot be changed as branches in transactionless media sets cannot be reset to empty.

The write mode of a media set output can be changed dynamically at runtime. This can be helpful in scenarios where the decision to fully replace an output is based on custom criteria in your pipeline.

To change the write mode of a media set, you can use the .set_write_mode() method on the media set output. The write mode can be changed at any point up until an item is uploaded to the output. For example:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 from transforms.api import transform, Input from transforms.mediasets import MediaSetOutput @transform( input_PNGs=Input('/examples/input_PNGs'), output_PNGs=MediaSetOutput('/examples/output_PNGs'), ) def upload_pngs(input_PNGs, output_PNGs): if should_replace(input_PNGs): output_PNGs.set_write_mode("replace") else: output_PNGs.set_write_mode("modify") output_PNGs.put_dataset_files(input_PNGs)

Upload from a filesystem (Catalog) dataset

The Python media set SDK has built-in tooling to upload the files from a conventional dataset in the Palantir filesystem (known as the Catalog) into a media set. For example:

Copied!
1 2 3 4 5 6 7 8 9 10 from transforms.api import transform, Input from transforms.mediasets import MediaSetOutput @transform( pdfs_dataset=Input('/examples/PDFs'), pdfs_media_set=MediaSetOutput('/examples/PDFs mediaset') ) def upload_to_media_set(pdfs_dataset, pdfs_media_set): pdfs_media_set.put_dataset_files(pdfs_dataset, ignore_items_not_matching_schema=False)

This transform will upload all items from the dataset into the media set. If any items do not match the schema of the media set (for example, if there is a JPEG in the dataset), then the build will fail. By setting ignore_items_not_matching_schema=True any such mismatches will instead be ignored.

Files can alternatively be uploaded one by one. For example:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 from transforms.api import transform, Input, Output, incremental from transforms.mediasets import MediaSetInput, MediaSetOutput import os @transform( output_mediaset=MediaSetOutput("/path/mediaset_output", should_snapshot=False), input_dataset=Input("/path/dataset_of_raw_files"), ) def compute(input_dataset, output_mediaset): all_files = list(input_dataset.filesystem().ls()) for current_file in all_files: with input_dataset.filesystem().open(current_file.path, 'rb') as f: filename = os.path.basename(current_file.path) output_mediaset.put_media_item(f, filename)

Extract layout-aware content from a document

When working with media sets, you can use a transform to extract content from a document, such as paragraphs, headers, and tables, along with additional metadata about the layout of this content. This extraction transform can be run on both PDF and image media sets.

Using the model to extract bounding boxes and passing to a vision model may yield better results for particularly complex or obscure documents.

To run this extraction in your transform, the Document Information Extraction model must be available on your enrollment. You can check whether the Document Information Extraction model is available by searching for it in the Model Catalog. Contact a Palantir representative if the Document Information Extraction model is unavailable and you would like to use it.

The output will be an array of "block" structs, which correspond to areas of the document. Each "block" will have a type, confidence, ID, bounding box, extracted text, extracted table in HTML (if applicable), the page number, and language information.

The following is an example Python transform that extracts layout-aware content from a PDF media set:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 from transforms.api import transform, Output from transforms.mediasets import MediaSetInput, MediaSetInputParam from pyspark.sql.functions import udf @transform( output=Output("ri.foundry.main.dataset.0-1-2-3-4"), media_input=MediaSetInput("/Foundry/My Media/PDF media set"), ) def compute(media_input: MediaSetInputParam, output): def extract_all_text(media_item_rid): metadata = media_input.get_media_item_metadata(media_item_rid) pages = metadata.document.pages if pages is None: return "" text = "" for page in range(pages): response = media_input.transform_media_item(media_item_rid, str(page), { "type": "documentToText", "documentToText": { "operation": { "type": "extractLayoutAwareContent", "extractLayoutAwareContent": { "parameters": { "languages": ["ENG"] } } } } }) text += str(response.json()) return text extract_text_udf = udf(extract_all_text) result = media_input.dataframe().withColumn("text", extract_text_udf("mediaItemRid")) column_typeclasses = { "mediaReference": [{"kind": "reference", "name": "media_reference"}] } output.write_dataframe(result, column_typeclasses=column_typeclasses)

We recommend using parallelism for optimal performance if you are running this extraction transform on many documents or on documents with many pages.

More information can be found regarding rate limits, regional availability and usage rates in the documentation.

Lightweight support

Media sets can be transformed from lightweight transforms as well. The API is the same as for standard Python transforms, though the dataframes for listing files will be from common single node computation libraries like pandas. The following example shows a lightweight transform that lists all items in the input media set and writes the resulting dataframe to a tabular output:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 from transforms.api import transform, Output, lightweight, LightweightOutput from transforms.mediasets import MediaSetInput, LightweightMediaSetInputParam @lightweight @transform( images=MediaSetInput('/examples/images'), listing_output=Output('/examples/listed_images') ) def translate_images(images: LightweightMediaSetInputParam, listing_output: LightweightOutput): media_items_listing = images.list_media_items_by_path_with_media_reference().pandas() # You can perform regular pandas transformations on media_items_listing listing_output.write_table(media_items_listing)

Reference: Built-in transformations

Transform a PDF document to JPEG

Transform an individual page of a PDF document into a JPEG and return it.

  • Operates on: PDF documents

  • Returns: JPEG image

  • Parameters:

    • media_item_rid: The RID of the media item to be transformed.
    • page_number: The zero-indexed page number.
    • Height (optional): The desired height of the output image, in pixels.
    • Width (optional): The desired width of the output image, in pixels.
  • Example:

Copied!
1 input_pdfs.transform_document_to_jpg("ri.mio.main.media-item.1", 0)

Transform a PDF document to PNG

Transform an individual page of a PDF document into a PNG and return it.

  • Operates on: PDF documents

  • Returns: PNG image

  • Parameters:

    • media_item_rid: The RID of the media item to be transformed.
    • page_number: The zero-indexed page number.
    • Height (optional): The desired height of the output image, in pixels.
    • Width (optional): The desired width of the output image, in pixels.
  • Example:

Copied!
1 input_pdfs.transform_document_to_png("ri.mio.main.media-item.1", 0)

Transform a PDF document to text by page with Optical Character Recognition (OCR)

OCR PDF into text output by page. This transform uses traditional OCR, as opposed to AI-powered OCR, which uses a vision language model to perform the extraction. Learn more about using a vision language model to extract PDF document content.

  • Operates on: PDF documents

  • Returns: Unstructured text in utf-8 encoding

  • Parameters:

    • media_item_rid: The RID of the media item to be transformed.
    • page_number: The zero-indexed page number.
    • Language (optional): Language code, defaulted to ENG.
  • Example:

Copied!
1 2 raw_output = input_pdfs.transform_document_to_text_ocr_output_text("ri.mio.main.media-item.1", 0) doc_text_ocr = raw_output.read().decode("utf-8")

Transform a PDF document to hOCR XML by page with OCR

OCR PDF into hOCR XML by page. Learn more about hOCR ↗.

  • Operates on: PDF documents

  • Returns: hOCR xml in utf-8 encoding

  • Parameters:

    • media_item_rid: The RID of the media item to be transformed.
    • page_number: The zero-indexed page number.
    • Language (optional): Language code, defaulted to ENG.
  • Example:

Copied!
1 input_pdfs.transform_document_to_text_ocr_output_hocr("ri.mio.main.media-item.1", 0)

Transform a PDF document to extract raw text

Extract field from the PDF and return it. This is a parsing method that does not require image processing unlike with OCR.

  • Operates on: PDF documents

  • Returns: Unstructured text in UTF-8 encoding

  • Parameters:

    • media_item_rid: The RID of the media item to be transformed.
    • page_number: The zero-indexed page number.
  • Example:

Copied!
1 2 raw_output = input_pdfs.transform_document_to_text_raw("ri.mio.main.media-item.1", 0) doc_text_extraction = raw_output.read().decode("utf-8")

Transform a PDF document to extract form fields

Extract all form fields from the whole PDF and return it.

  • Operates on: PDF documents

  • Returns: JSON

  • Parameters:

    • media_item_rid: The RID of the media item to be transformed.
  • Example:

Copied!
1 input_pdfs.transform_document_to_text_extract_field("ri.mio.main.media-item.1")

Transform a PDF document to extract table of contents

Extract field from the PDF and return it.

  • Operates on: PDF documents

  • Returns: JSON

  • Parameters:

    • media_item_rid: The RID of the media item to be transformed.
  • Example:

Copied!
1 input_pdfs.transform_document_to_text_extract_table_of_contents("ri.mio.main.media-item.1")

Transform image into hOCR XML

OCR image into hOCR XML. Learn more about hOCR ↗.

  • Operates on: Image

  • Returns: JSON

  • Parameters:

    • media_item_rid: The RID of the media item to be transformed.
  • Example:

Copied!
1 input_pdfs.transform_image_to_text_ocr_output_hocr("ri.mio.main.media-item.1")

Transform image into text

OCR image into text.

  • Operates on: Image

  • Returns: Unstructured text in utf-8 encoding

  • Parameters:

    • media_item_rid: The RID of the media item to be transformed.
  • Example:

Copied!
1 input_images.transform_image_to_text_ocr_output_text("ri.mio.main.media-item.1")

Transcribe audio to text

Transcribe audio file with speech into text.

  • Operates on: Audio

  • Returns: Unstructured plain text file with transcription

  • Parameters:

    • media_item_rid: The RID of the media item to be transformed.
    • Language (optional): Language code to transcribe audio to. If left empty, the language will be inferred from the beginning of the audio file. Use either the set 1 or the ISO language name: https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes
  • Examples:

Copied!
1 2 input_audio_files.transcribe("ri.mio.main.media-item.1") input_audio_files.transcribe("ri.mio.main.media-item.1", TranscriptionLanguage.ARABIC)