A media set is a collection of media files with a common schema, for example, files of the same format. Media sets are designed to work with high-scale, unstructured data and enable the processing of media items such as audio, imagery, video, and documents. Media sets enable access to flexible storage, compute optimizations, and schema-specific transformations to enhance media workflows and pipelines.
Example media set workflows include:
To begin building a pipeline, follow these steps:
The following file types are supported as media sets:
.wav
).mp3
).sph
).flac
).opus
, .ogg
).wav
).webm
).pdf
)*.docx
).png
).jpg
, .jpeg
).jp2
).bmp
).tiff
, .tif
).nitf
).dcm
).mkv
).mp4
).mov
).ts
)PDF files that require proprietary features to view or are protected by passwords, digital signatures, or encryption are not supported.
Media sets can be configured for import through a direct upload, a connection to an external source system in Data Connection, API posts, and transforms (including external transforms).
To import media files to a media set through a direct upload, drag and drop the files into your new media set. Files must match the expected file type specified upon creation of the media set to be uploaded to the media set.
media set
from the search bar as shown below.Media sets can be imported using a sync to an external source through Data Connection. A detailed walk-through can be found in the media set sync documentation.
To create a new media set sync, navigate to the Overview tab of the desired source.
After you create the sync, trigger a build in the media set view for the media to appear in your media set.
You can also connect an existing source to a new media set via the Select a source option.
For supported source types, media sets can optionally be configured to read directly from the external source system so no data is copied into Foundry’s backing store ("virtual media sets").
Currently, virtual media sets are only supported for certain source types. If you are interested in virtual storage for other source types for your use case, reach out to Palantir support.
For sources with REST APIs, you can import media to a media set through external transforms.
Media sets can also be directly imported into Pipeline Builder. Learn more about available upload methods in Pipeline Builder.
You can configure a time-based retention policy for a media set, for example 14 days, for data that does not need to exist forever. Media items will only be accessible for the retention window, after which they will be permanently deleted. This is a helpful option to minimize storage costs.
Once a media item's retention window expires, it will never become accessible again, and will be deleted. For example:
Common media set transformations are available in Pipeline Builder. Learn how to build a batch pipeline with media sets on Pipeline Builder.
Here is an example of the Text Extraction (OCR option) board used on a PDF:
If you are interested in a transformation that is not currently available, contact your Palantir representative.
Media sets also support specialized transformations like PDF text extraction, optical character recognition (OCR), image tiling, and metadata parsing that can be leveraged in Python transforms by importing the transforms-media
library.
Common transformations can be found in the documentation on using media sets with Python transforms.
Here is an example on how you can get started with media sets in Code Repositories:
Copied!1 2 3 4 5 6 7 8
from transforms.api import transform from transforms.mediasets import MediaSetInput, MediaSetOutput @transform( images=MediaSetInput('/examples/images'), output_images=MediaSetOutput('/examples/output_images') ) def translate_images(images, output_images): ...
Advanced users and developers can take advantage of media set access patterns, which are pre-configured transformations that can be performed on-demand on the media items in a media set. Access patterns have persistence policies for storage and optimization tuning, enabling the option to recompute at each request, persist outputs after first request indefinitely, or cache for a time.
Access patterns are leveraged by the Foundry platform to optimally process or render media set items. For example:
The default available set of access patterns is determined based on the configured media set schema. Additional transformations are registered as access patterns to a media set via API call only.
Items in media sets can be referenced using media references. Media references enable you to use a media item in Foundry without having to make copies of the media item itself.
Use media references to reference media set items in datasets. This is useful for associating media items with metadata or other information in a tabular format. For example, you may associate the original PDF with its file name, page count, and extracted text as additional columns.
You can also use media references as inputs to model adapters for batch inference pipelines.
To produce a list of media references for your media set, use the Get media references
function in Pipeline Builder. You can also produce media references in Python Transforms by importing the transforms-media
library and calling the list_media_items_by_path_with_media_reference
method:
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13
from pyspark.sql import functions as F from transforms.api import transform, Input, Output from transforms.mediasets import MediaSetInput @transform( metadata_out=Output("{YOUR_OUTPUT_METADATA_DATASET}"), mediaset_in=MediaSetInput("{YOUR_MEDIA_SET_RID}") ) def compute(ctx, mediaset_in, metadata_out): media_references = mediaset_in.list_media_items_by_path_with_media_reference(ctx) column_typeclasses = {'mediaReference': [{'kind': 'reference', 'name': 'media_reference'}]} # Enables in-line thumbnails in dataset metadata_out.write_dataframe(media_references, column_typeclasses=column_typeclasses)
Use media reference object properties to efficiently display your media in applications that build on the ontology. Optimizations include faster and interactive previews in Workshop or Object Explorer, as well as tiling for geospatial imagery in Map.
Use objects with media reference object properties in functions on objects.
You can read the raw media item directly. Additionally, you can perform common type-specific operations on the media item, such as:
You can delete media items from a media set by selecting the media item that you want to delete, and selecting the Delete action. To prevent accidental deletion, this action will require you to select Delete in the pop-up again to confirm your intention of deleting a media item.
Once you have successfully deleted the item, the media set will refresh with a success message. You can now view the media set without the deleted media item.
Media sets bring a number of advanced, out-of-the-box transformations to the platform. In addition to being triggered via transforms and pipelines, media transformations are also triggered by interacting with media items via the frontend (for instance, by previewing a media item). Additionally, there is a cost to download or stream the full contents of a media item.
Usage is tracked in units of Foundry compute-seconds. The table below describes each transformation available, with usage rate in terms of compute-seconds per gigabyte processed.
If you have an enterprise contract with Palantir, contact your Palantir representative before proceeding with usage calculations.
Usage rate is measured in compute-seconds per GB
Transformation | Usage Rate |
---|---|
Download / stream | 2 |
Transformation | Usage Rate |
---|---|
Rotate | 40 |
Resize | 40 |
Generate PDF | 40 |
Adjust Contrast | 75 |
Crop / chip | 75 |
Grayscale | 75 |
Geo tile | 75 |
Render DICOM image layer | 75 |
Extract text (OCR) | 275 |
Encryption / decryption | 75 |
Transformation | Usage Rate |
---|---|
Transcode | 75 |
Waveform generation | 75 |
Transcription | 275 |
Transformation | Usage Rate |
---|---|
Get timestamps for scene frames | 40 |
Extract audio | 75 |
Extract frames at timestamp | 75 |
Extract all scene frames | 275 |
Stream with HLS | 275 |
Transcode | 275 |
Transformation | Usage Rate |
---|---|
Render page as image | 40 |
Render page as image within bounding box | 40 |
Get PDF page dimensions | 40 |
Slice PDF range | 75 |
Extract form fields | 75 |
Extract table of contents | 75 |
Extract text on page (raw) | 75 |
Extract all text (raw) | 75 |
Extract text (OCR) | 275 |
MediaSet:MediaItemPathInvalid
error.