External transforms allow direct references to egress policies, credential values, and export control markings. The new source-based external transforms work primarily by importing a Data Connection source that contains these settings into a single resource.
Key advantages of source-based external transforms include support for the following:
Review our documentation for more information on using source-based external transforms.
You can use Python transforms in Code Repositories to connect to an external system directly in code. This could be useful if, for example, you require complex logic for interacting with an external system or you want to use a software development kit (SDK) provided by a third party. We recommend using external transforms to perform scheduled syncs and exports using REST APIs.
Before continuing, be sure to create a Python transforms code repository and configure it following the instructions below.
Transforms connecting to an external system can only be previewed using the Preview helper in Code Repositories if they do not use Foundry inputs..
This section covers the configurations required to use external transforms.
External transforms may only be used with repositories running in SECURE mode. Alternative security modes are only available by speaking with your Palantir representative.
Within the code repository you just created, a Foundry user with the Information Security Officer
role should navigate to the repository Settings tab, then navigate to Repository > External systems to toggle on the option to Allow access to external systems from this repository.
The Information Security Officer
is a default role in Foundry; users can be granted the Information Security Officer
role in Control Panel under Enrollment permissions.
A Foundry user with the Information Security Officer
role may optionally allow the use of inputs in external transforms.
When a dataset is used as an input to a transform that can communicate with an external system, any data in that dataset can potentially leave Foundry. A Foundry user with the Information Security Officer
role should select the set of security Markings and Organization Markings for any data that should be allowed as an input to external transforms in this repository.
As an example, you may have a dataset with a Sensitive Marking in the Palantir Organization. To use this dataset in an external transform, the Information Security Officer must add both the Sensitive Marking and the Palantir Organization under step three, Configure use of Foundry inputs with external systems, in the code repository settings.
transforms-external-systems
libraryOnce external system interactions are enabled for the code repository, you must add the transforms-external-systems
library from the Libraries sidebar tab to the left of the code interface.
After adding the library, you can access a new Egress and Credentials tab to the left of the code interface.
In this tab, you can set up the network egress policies required for any endpoints called from your code, and add credentials (if necessary) for the repository to use. Note that not all endpoints require credentials.
Foundry requires all connections initiated from user-authored code to be bound to a network egress policy when reaching out to external destinations. A network egress policy allows permitted connections that are initiated from user-authored code to reach destinations outside Foundry.
If self-service egress is enabled for your enrollment, you can create and administer egress policies in Control Panel. If your connection requires you to add a new network route, contact your Information Security Officer or Palantir representative with the details of your egress policy.
If your Foundry instance does not have self-service egress enabled, contact your Palantir representative to internally enable the endpoint and configure the policy.
Before using egress policies for external transforms (whether global or non-global), you must import them from the Egress tab in the code repository, found in the Egress and Credentials panel. The Egress tab exposes all policies to which you have permissions and classifies them into three categories:
Viewer
but not Importer
permissions on the policy.Within this tab, search for the desired network egress policy. If the policy is marked as Importable
, select it and choose Import egress policy to project.
If the desired network egress policy is marked as Access request required
, a Foundry user with the Information Security Officer
role should grant you the Importer
role in the Network egress page in Control Panel. To grant the Importer
role, the Information Security Officer should first select the desired network policy, then select Actions > Manage Sharing to add you as an Importer
of the policy.
If your external system requires authorized credentials (username and password) to gain access, you can add a set of credentials using the Credentials tab in the Egress and Credentials side panel and use them in your code.
You must allowlist Foundry IP addresses if the source endpoint has an allowlisting process in place. Contact your Palantir representative for the required IP addresses.
If your external system requires an SSL/TLS certificate to be passed, you can store the certificate in a credential, then use the credential in the external transforms logic.
You are now ready to write a Python data transform to an external system. We recommend reviewing the basic instructions on how to write a Python data transformation before writing an external transform.
From the transforms.external.systems
package, import the required use_external_systems
decorator along with the EgressPolicy
and Credential
(if needed) inputs. From the transforms-api
, import the transform decorator and Output
. If your external source does not require credentials, you do not need to add Credential
inputs to your logic.
You must use the use_external_systems
decorator to successfully write external transforms logic.
The following example shows how to use the two decorators:
Copied!1 2 3 4 5 6 7 8 9 10 11 12
from transforms.api import transform, Output from transforms.external.systems import use_external_systems, EgressPolicy, Credential @use_external_systems( egress=EgressPolicy('<policy RID>'), creds=Credential('<credential RID>') ) @transform( output=Output('/path/to/output/dataset') ) def compute(egress, creds, ...): # ...
You can then set up a simple transform to reach out to an API:
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
from transforms.api import transform, Output from transforms.external.systems import EgressPolicy, use_external_systems, Credential import requests @use_external_systems( egress=EgressPolicy('<policy RID>'), creds=Credential('<credential RID>') ) @transform( output=Output('/path/to/output/dataset') ) def compute(egress, output, creds): username = creds.get('username') password = creds.get('password') response = requests.get('https://<API URL>', auth=(username, password), timeout=10).text with output.filesystem().open('response.json', 'w') as f: f.write(response)
You can also specify multiple egress policies or credentials in a single transform, as shown in the example below.
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14
from transforms.api import transform, Output from transforms.external.systems import use_external_systems, EgressPolicy, Credential @use_external_systems( egress1=EgressPolicy('<first policy RID>'), egress2=EgressPolicy('<second policy RID>'), creds1=Credential('<first credential RID>'), creds2=Credential('<second credential RID>') ) @transform( output=Output('/path/to/output/dataset') ) def compute(egress1, egress2, creds1, creds2, ...): # ...
In some cases, it may be useful to write external transforms that process Foundry input data. For example, you might want to query an API to gather additional metadata for each row in a tabular dataset. Alternatively, you might have a workflow where you need to mirror Foundry data into an external software system.
Such cases are considered export-controlled workflows, as they open the possibility of exporting secure Foundry data into another system with unknown security guarantees and severed data provenance. It is the transform developer's responsibility to assess the security of an export workflow and ensure that the data leaving Foundry is correctly secured in external systems. Foundry provides governance controls to ensure developers can clearly encode security intent, and Information Security Officers can audit the scope and intent of workflows interacting with external systems.
To opt into export controls, apply an ExportControl
to your external transform. The ExportControl
accepts a list of security Markings and Organization IDs that are intended for export. The transform job will then be guaranteed to fail if upstream data is marked with additional security Markings. To expand the set of exportable Markings supported in a code repository, an Information Security Officer must adjust the repository settings. Navigate to the Settings > Repository tab. Then, in the External Systems section, choose to Allow use of Foundry inputs with external systems in this repository under Step 3.
Security Markings and Organizations are both implemented as Markings and should be listed together in the export control configuration. An error message about missing Markings may be referring to security Markings or Organizations.
As an example, you could use an export external transform to download an image URL as specified in a Foundry dataset:
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
from transforms.api import transform, Input, Output from transforms.external.systems import use_external_systems, ExportControl, EgressPolicy from pyspark.sql.functions import udf import shutil @use_external_systems( export_control=ExportControl(markings=['<marking ID>']), egress=EgressPolicy(<policy RID>), ) @transform( images_output=Output('/path/to/output/dataset'), image_urls=Input('/path/to/input/dataset'), ) def compute(export_control, egress, images_output, image_urls): @udf def download(name, url): response = requests.get(url, stream=True) with images_output.filesystem().open(f'{name}.jpg', mode='wb') as out_file: shutil.copyfileobj(response.raw, out_file) return True image_urls.dataframe().withColumn('downloaded', download(col('Name'), col('Url'))).collect()
Media set outputs are in a beta stage of development and may not be available on your enrollment.
When using external transforms to ingest media files like images, audio, or PDFs, you may want to write the files out to a media set. To write a media item to an existing media set, programmatically put the item in the specified media set output:
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
from transforms.api import transform from transforms.mediasets import MediaSetOutput from transforms.external.systems import use_external_systems import requests import tempfile @use_external_systems() @transform( media_set_output=MediaSetOutput('/path/to/output/media/set') ) def compute(media_set_output): response = requests.get('https://<API URL>') fname = 'my_image.png' with tempfile.NamedTemporaryFile() as tmp: tmp.write(response.content) tmp.flush() with open(tmp.name, 'rb') as tmp_read: media_set_output.put_media_item(tmp_read, path=fname)
You are able to programmatically access and use API services if they exist within the external system. As an example, you can use the Mapbox Static Images API ↗ from an external transform to output a map image to a Foundry dataset:
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
from transforms.external.systems import EgressPolicy, use_external_systems, Credential from transforms.api import transform, Output from mapbox import Static @use_external_systems( egress=EgressPolicy('<policy RID>'), mapbox_creds=Credential('<credential RID>') ) @transform( output=Output('/Users/username/datasets/example_mapbox'), ) def compute(output, egress, mapbox_creds): mapbox_access_token = mapbox_creds.get('mapbox-token') client = Static(access_token=mapbox_access_token) london = client.image('mapbox.satellite', lat=51.5072, lon=0.0, z=12) with output.filesystem().open('london.log', 'w') as f: f.write(f'response code: {london.status_code}') with open('london.png', 'wb') as f: f.flush() with output.filesystem().open('london.png', 'wb') as f: f.write(london.content)
You are able to programmatically open, stream, and interact with files as you would with a regular transform. Learn more about reading and writing unstructured files.
You are able to run external transforms incrementally as you would with a regular transform.
Maintaining state between execution is often needed in external incremental transforms. With regular incremental transforms, the input files are tracked and compared to the latest transaction processed. For external incremental transforms, the incremental state must be stored in an alternative way explicitly. One option is to save a state file which will hold the latest state of the incremental transform, as shown in the example below:
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
from transforms.api import transform, Input, Output, configure, incremental from pyspark.sql import Row from pyspark.sql import functions as F from pyspark.sql import types as T import logging import time import json log = logging.getLogger(__name__) @incremental() @configure(profile=["KUBERNETES_NO_EXECUTORS"]) @use_external_systems( egress=EgressPolicy('<policy RID>'), creds=Credential('<credential RID>') ) @transform( out=Output("<output_dataset>"), ) def compute(ctx, out, egress, creds): # Get the filesystem of the output dataset to read and write files onto it: out_fs = out.filesystem() state_filename = "_state.json" # Prepend with "_" to consider it a hidden file and not show it in the dataset preview. state = {"last_seen" : 0} # Some arbitrary starting state # Try to fetch the state from the output dataset: try: with out_fs.open(state_filename, mode='r') as state_file: data = json.load(state_file) # Validate the fetched state: state = data logging.info(f"state file found, continuing from : {data}") except Exception as e: logging.warn("state file not found, starting over from default state") # Here, write logic to make API Calls, create custom processing, generate a dataframe and save it, or save files directly to the output dataset. # For example purposes, the following generates a dataframe and saves it on the output dataset (as if we fetched some data through API calls): out.write_dataframe(get_dataframe(ctx)) # Update the state for next iteration: state["last_seen"] = 1 + state["last_seen"] # Save the new state on the output dataset: with out_fs.open(state_filename, "w") as state_file: json.dump(state, state_file) # This function generates a dataframe for example purposes. This step would not be relevant in a production implementation: def get_dataframe(ctx): # Define the schema of the dataframe: schema = T.StructType([ T.StructField("name", T.StringType(), True), T.StructField("age", T.IntegerType(), True), T.StructField("city", T.StringType(), True) ]) # Create a list of rows: data = [("Alice", 25, "New York"), ("Bob", 30, "San Francisco"), ("Charlie", 35, "London")] # Create a PySpark dataframe: df = ctx.spark_session.createDataFrame(data, schema=schema)