Media sets (unstructured data)

A media set is a collection of media files with a common schema, for example, files of the same format. Media sets are designed to work with high-scale, unstructured data and enable the processing of media items such as audio, imagery, video, and documents. Media sets enable access to flexible storage, compute optimizations, and schema-specific transformations to enhance media workflows and pipelines.

Media sets support the import of audio, imagery, video, and documents.

Example media set workflows include:

  • Enabling content analysis by extracting text from PDFs with Pipeline Builder
  • Performing geospatial analysis with raster tiling (TIFF, NITF) in the Map application
  • Processing medical imaging files (DICOM format) with Pipeline Builder

To begin building a pipeline, follow these steps:

Supported media set file types

The following file types are supported as media sets:

  • Audio
    • WAV (.wav)
    • MP3 (.mp3)
    • NIST SPHERE (.sph)
    • FLAC (.flac)
    • OGG (.opus, .ogg)
    • WAV (.wav)
    • WEBM (.webm)
  • Document
    • PDF (.pdf)*
    • DOCX (.docx)
  • Image
    • PNG (.png)
    • JPEG (.jpg, .jpeg)
    • JP2K (.jp2)
    • BMP (.bmp)
    • TIFF (.tiff, .tif)
    • NITF (.nitf)
    • DICOM (.dcm)
  • Video
    • MKV (.mkv)
    • MP4 (.mp4)
    • MOV (.mov)
    • TS (.ts)
PDF support

PDF files that require proprietary features to view or are protected by passwords, digital signatures, or encryption are not supported.

Import media

Media sets can be configured for import through a direct upload, a connection to an external source system in Data Connection, API posts, and transforms (including external transforms).

Direct upload

To import media files to a media set through a direct upload, drag and drop the files into your new media set. Files must match the expected file type specified upon creation of the media set to be uploaded to the media set.

  1. First, create a new media set by selecting New within a Project and selecting media set from the search bar as shown below.

Create a media set from a Project

  1. Next, choose the desired media file type for the new media set and select Create media set.

Choose a media file type

  1. Once you have created a media set, you can upload media via drag-and-drop onto the empty media set or by selecting the choose from your computer prompt.

Upload from an empty media set

Data Connection

Media sets can be imported using a sync to an external source through Data Connection. A detailed walk-through can be found in the media set sync documentation.

To create a new media set sync, navigate to the Overview tab of the desired source.

After you create the sync, trigger a build in the media set view for the media to appear in your media set.

You can also connect an existing source to a new media set via the Select a source option.

Add existing sources into a media set

Virtual storage

For supported source types, media sets can optionally be configured to read directly from the external source system so no data is copied into Foundry’s backing store ("virtual media sets").

Currently, virtual media sets are only supported for certain source types. If you are interested in virtual storage for other source types for your use case, reach out to Palantir support.

storage policy

External transforms

For sources with REST APIs, you can import media to a media set through external transforms.

Pipeline Builder

Media sets can also be directly imported into Pipeline Builder. Learn more about available upload methods in Pipeline Builder.

Retention policies

You can configure a time-based retention policy for a media set, for example 14 days, for data that does not need to exist forever. Media items will only be accessible for the retention window, after which they will be permanently deleted. This is a helpful option to minimize storage costs.

Once a media item's retention window expires, it will never become accessible again, and will be deleted. For example:

  • When a retention window is reduced, such as from 30 days to 7 days, all media items that are older than the new window (7 days) will immediately become inaccessible.
  • When a retention window is expanded, such as from 7 days to 30 days, media items that previously expired (7 days and 1 second) will not become accessible. The same is true if retention is changed to "forever".

Transform media in Foundry

Pipeline Builder

Common media set transformations are available in Pipeline Builder. Learn how to build a batch pipeline with media sets on Pipeline Builder.

Here is an example of the Text Extraction (OCR option) board used on a PDF:

Text extraction on pipeline builder

If you are interested in a transformation that is not currently available, contact your Palantir representative.

Code Repositories

Media sets also support specialized transformations like PDF text extraction, optical character recognition (OCR), image tiling, and metadata parsing that can be leveraged in Python transforms by importing the transforms-media library.

Common transformations can be found in the documentation on using media sets with Python transforms.

Here is an example on how you can get started with media sets in Code Repositories:

Copied!
1 2 3 4 5 6 7 8 from transforms.api import transform from transforms.mediasets import MediaSetInput, MediaSetOutput @transform( images=MediaSetInput('/examples/images'), output_images=MediaSetOutput('/examples/output_images') ) def translate_images(images, output_images): ...

Access patterns

Advanced users and developers can take advantage of media set access patterns, which are pre-configured transformations that can be performed on-demand on the media items in a media set. Access patterns have persistence policies for storage and optimization tuning, enabling the option to recompute at each request, persist outputs after first request indefinitely, or cache for a time.

Access patterns are leveraged by the Foundry platform to optimally process or render media set items. For example:

  • Thumbnails and previews for PDFs in Workshop
  • Buffered audio waveforms in the Preview application
  • Tiled satellite imagery in Map

The default available set of access patterns is determined based on the configured media set schema. Additional transformations are registered as access patterns to a media set via API call only.

Media references

Items in media sets can be referenced using media references. Media references enable you to use a media item in Foundry without having to make copies of the media item itself.

Use media references to reference media set items in datasets. This is useful for associating media items with metadata or other information in a tabular format. For example, you may associate the original PDF with its file name, page count, and extracted text as additional columns.

You can also use media references as inputs to model adapters for batch inference pipelines.

To produce a list of media references for your media set, use the Get media references function in Pipeline Builder. You can also produce media references in Python Transforms by importing the transforms-media library and calling the list_media_items_by_path_with_media_reference method:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 from pyspark.sql import functions as F from transforms.api import transform, Input, Output from transforms.mediasets import MediaSetInput @transform( metadata_out=Output("{YOUR_OUTPUT_METADATA_DATASET}"), mediaset_in=MediaSetInput("{YOUR_MEDIA_SET_RID}") ) def compute(ctx, mediaset_in, metadata_out): media_references = mediaset_in.list_media_items_by_path_with_media_reference(ctx) column_typeclasses = {'mediaReference': [{'kind': 'reference', 'name': 'media_reference'}]} # Enables in-line thumbnails in dataset metadata_out.write_dataframe(media_references, column_typeclasses=column_typeclasses)

Ontologize media using media references

Use media reference object properties to efficiently display your media in applications that build on the ontology. Optimizations include faster and interactive previews in Workshop or Object Explorer, as well as tiling for geospatial imagery in Map.

Custom logic using media reference properties

Use objects with media reference object properties in functions on objects.

You can read the raw media item directly. Additionally, you can perform common type-specific operations on the media item, such as:

  • OCR on documents
  • text extraction from documents
  • audio transcription
  • read media item metadata

Delete media items from media sets

You can delete media items from a media set by selecting the media item that you want to delete, and selecting the Delete action. To prevent accidental deletion, this action will require you to select Delete in the pop-up again to confirm your intention of deleting a media item.

Delete media item

Once you have successfully deleted the item, the media set will refresh with a success message. You can now view the media set without the deleted media item.

Successful deletion

Media set compute usage

Media sets bring a number of advanced, out-of-the-box transformations to the platform. In addition to being triggered via transforms and pipelines, media transformations are also triggered by interacting with media items via the frontend (for instance, by previewing a media item). Additionally, there is a cost to download or stream the full contents of a media item.

Usage is tracked in units of Foundry compute-seconds. The table below describes each transformation available, with usage rate in terms of compute-seconds per gigabyte processed.

If you have an enterprise contract with Palantir, contact your Palantir representative before proceeding with usage calculations.

Transformations

Usage rate is measured in compute-seconds per GB

All

TransformationUsage Rate
Download / stream2

Images

TransformationUsage Rate
Rotate40
Resize40
Generate PDF40
Adjust Contrast75
Crop / chip75
Grayscale75
Geo tile75
Render DICOM image layer75
Extract text (OCR)275
Encryption / decryption75

Audio

TransformationUsage Rate
Transcode75
Waveform generation75
Transcription275

Video

TransformationUsage Rate
Get timestamps for scene frames40
Extract audio75
Extract frames at timestamp75
Extract all scene frames275
Stream with HLS275
Transcode275

Documents

TransformationUsage Rate
Render page as image40
Render page as image within bounding box40
Get PDF page dimensions40
Slice PDF range75
Extract form fields75
Extract table of contents75
Extract text on page (raw)75
Extract all text (raw)75
Extract text (OCR)275

Media set limits

  • Transactional media sets have a limit of 10,000 items per transaction.
  • Transactionless media sets do not have an item limit.
  • Paths of items in media sets cannot exceed 256 characters. Attempting to add an item with a path longer than 256 characters to a media set will result in a MediaSet:MediaItemPathInvalid error.