Data Connection sources can now be directly imported into code repositories. Source-based external transforms are the preferred method to connect to external systems, superseding legacy external transforms
External transforms allow connections to external systems from Python transforms repositories.
External transforms are primarily used to perform batch sync, export, and media sync workflows when one of the following is true:
Solutions to these situations may include the following:
Any transforms that use virtual tables are also considered to be external transforms, since the transforms job must be able to reach out to the external system that contains the virtualized data. To use virtual tables in Python transforms, follow the instructions below for details on how to set up the source.
In this setup guide, we will walk through creating a Python transforms repository that connects to a free public API of Pokemon data ↗. The examples then use this API to explain various features of external transforms and how they can be used with the API.
The Pokemon API used in this setup guide is unaffiliated with Palantir and may change at any time. This tutorial is not an endorsement, recommendation, or suggestion to use this API for production use cases.
Before following this guide, be sure to first create a Python transforms repository and review how to author Python transforms as described in our tutorial. All features of Python transforms are compatible with external transforms.
Before you can connect to an external system from your Python repository, you must create a Data Connection source that you can import into code. For this tutorial, we will create a REST API source that connects to the PokeAPI mentioned above.
The quickest way to create a source for use in external transforms is from a Python transforms code repository. Once you have initialized a repository, complete the following steps to set up a generic source:
For this tutorial, you should add an egress policy for the PokeAPI: pokeapi.co
. You will not need any secrets since this API does not require authentication, and export controls may be skipped for now. However, they will be required to use Foundry data inputs with this source.
Since this connection is to a REST API, you will be automatically prompted to convert your generic connector to a REST API source so that you can use the built-in Python requests client.
You may also create a source from the Data Connection application or use an existing source you have already configured. To use this option, follow the steps below:
Review the Overview page, then select Continue in the bottom right. You will be prompted to choose the connection runtime: a direct connection, through an agent worker, or through an agent proxy. Since agent worker connections are not supported for external transforms, choose to use a direct connection to the Internet or an agent proxy to connect to the PokeAPI.
Choose a name for your source, and select a Project to which it should be saved.
Fill out the Domains section with the connection information of the API source. The configuration for the PokeAPI example is shown below:
REST API sources with multiple domains may not be imported. Instead, you should create a separate REST API source per domain if multiple domains are required in the same external transform.
First, you must allow the REST API source to import into code. To configure this setting, navigate to the source in Data Connection, then to the Connection settings > Code import configuration tab.
Toggle on the option to Allow this source to be imported into code repositories. Any code repositories that import this source will be displayed on this page.
transforms-external-systems
library. Libraries are installed using the Libraries tab in the left side panel, searching for the desired library, then selecting Install.Learn more about installing and managing libraries..
You must have at least Editor
access to the source to be able to import it in the repository. Read more about permissions
Once you set up a Python transforms repository that imports your PokeAPI source, you are ready to start writing Python transforms code that uses the source to connect externally.
@external_systems
decoratorTo use external transforms, you must import external_systems
decorator and Source
object from the transforms.external.systems
library:
Copied!1from transforms.external.systems import external_systems, Source
You should then specify the sources that should be included in a transform by using the external_systems
decorator:
Copied!1@external_systems( 2 poke_source=Source("ri.magritte..source.e301d738-b532-431a-8bda-fa211228bba6") 3)
Sources will automatically be rendered as links to open in Data Connection and will display the source name instead of the resource identifier.
Once a source is imported into your transform, you can access attributes of the source using the built-in connection object using the get_https_connection()
method. The example below shows how we can grab the base URL of the PokeAPI source we configured in the previous step.
Copied!1poke_url = poke_source.get_https_connection().url
Additional secrets or credentials stored on the source can also be accessed from the source. To identify the secret names that can be accessed, navigate to the left panel in your transform.
Use the following syntax to access secrets in code:
Copied!1poke_source.get_secret("additionalSecretFoo")
Currently, it is not possible to access source attributes that are not credentials unless the source provides an HTTPS client. For example, on a PostgreSQL source you will not be able to access the hostname
or other non-secret attributes.
For sources that provide a RESTful API, the source object allows you to interact with a built-in HTTPS client. This client will be pre-configured with all of the details specified on the source, including any server or client certificates, and you can simply start making requests to the external system.
Copied!1poke_url = poke_source.get_https_connection().url 2poke_client = poke_source.get_https_connection().get_client() 3 4# pokeClient is a pre-configured Session object from Python `requests` library. 5# Example of GET request: 6response = poke_client.get(poke_url + "/api/v2/pokemon/" + name, timeout=10)
When connecting to an on-premise system using an agent proxy, you must use the built-in client, since that will be automatically configured with the necessary agent proxy configuration.
The below example illustrates a complete transform that pages through all pokemon returned by the API in batches of 100 at a time, and outputs all pokemon names to a dataset.
Copied!1from transforms.api import transform_df, Output 2from transforms.external.systems import external_systems, Source 3from pyspark.sql import Row 4import json 5import logging 6 7logger = logging.getLogger(__name__) 8 9 10@external_systems( 11 # specify the source that was imported to the repository 12 poke_source=Source("ri.magritte..source.e301d738-b532-431a-8bda-fa211228bba6") 13) 14@transform_df( 15 # this transform doesn't use any inputs, and only specifies an output dataset 16 Output("/path/to/output/dataset") 17) 18def compute(poke_source, ctx): 19 poke = poke_source.get_https_connection().get_client() 20 poke_url = poke_source.get_https_connection().url 21 22 data = [] 23 24 start_url = poke_url + "/api/v2/pokemon?limit=100&offset=0" 25 while start_url is not None: 26 # loop through until no more pages are available 27 logger.info("Fetched data from PokeAPI:" + start_url) 28 29 # fetches up to 100 pokemon per page using the built-in HTTPS client for the PokeAPI source 30 response = poke.get(start_url) 31 32 response_json = json.loads(response.text) 33 for pokemon in response_json["results"]: 34 data.append(Row(name=pokemon["name"])) 35 start_url = response_json["next"] 36 37 # the data fetched and parsed from the external system are written to the output dataset 38 return ctx.spark_session.createDataFrame(data)
External transforms often need to use Foundry input data. For example, you might want to query an API to gather additional metadata for each row in a tabular dataset. Alternatively, you might have a workflow where you need to export Foundry data into an external software system.
Such cases are considered export-controlled workflows, as they open the possibility of exporting secure Foundry data into another system with unknown security guarantees and severed data provenance. When configuring a source connection, the source owner must specify whether or not data from Foundry may be exported, and provide the set of security markings and organizations may be exported. Foundry provides governance controls to ensure developers can clearly encode security intent, and Information Security Officers can audit the scope and intent of workflows interacting with external systems.
Exports are controlled using security markings. When configuring a source, the export configuration is used to specify which security markings and organizations are safe to export to the external system. This is done by navigating to the source in the data connection application, and then navigating to the Connection settings > Export configuration tab. You should then toggle on the option to Enable exports to this source and select the set of markings and organizations that may potentially be exported.
Doing this requires permission to remove Markings on the relevant data and Organizations, since exporting is considered equivalent to removing Markings on data within Foundry.
The setting to Enable exports to this source must be toggled on to allow the following:
Below you can see an example export configuration for the PokeAPI source, allowing data from the Palantir
organization with no additional security markings to be exported to the PokeAPI:
Note that Enable exports to this source must be toggled on even if you are not actually exporting data to this system, since allowing Foundry data inputs into the same compute job with an open connection to this system means that data could be exported.
In this example, we start with an input dataset of pokemon names, and use the PokeAPI to output an enriched dataset that includes the height and weight of each pokemon. It also illustrates basic error handling based on the status code of the response.
Copied!1from transforms.api import transform_df, Output, Input 2from transforms.external.systems import external_systems, Source 3from pyspark.sql import Row 4import json 5import logging 6 7logger = logging.getLogger(__name__) 8 9 10@external_systems( 11 poke_source=Source("ri.magritte..source.e301d738-b532-431a-8bda-fa211228bba6") 12) 13@transform_df( 14 # output dataset of enriched pokemon data retrieved from PokeAPI 15 Output("/path/to/output/dataset"), 16 17 # input dataset of pokemon names 18 pokemon_list=Input("/path/to/input/dataset") 19) 20def compute(poke_source, pokemon_list, ctx): 21 poke = poke_source.get_https_connection().get_client() 22 poke_url = poke_source.get_https_connection().url 23 24 def make_request_and_enrich(row: Row): 25 name = row["name"] 26 response = poke.get(poke_url + "/api/v2/pokemon/" + name, timeout=10) 27 28 if response.status_code == 200: 29 data = json.loads(response.text) 30 height = data["height"] 31 weight = data["weight"] 32 else: 33 logger.warn(f"Request for {name} failed with status code {response.status_code}.") 34 height = None 35 weight = None 36 37 return Row(name=name, height=height, weight=weight) 38 39 return pokemon_list.rdd.map(make_request_and_enrich).toDF()
Before using external transforms, make sure to familiarize yourself with the Data Connection - Permissions reference page.
The following are some key workflow differences between external transforms and legacy external transforms:
@use_external_systems()
to @external_systems()
.Key advantages of external transforms include the following:
There is currently no automatic migration path to update external transforms to external transforms. However, the manual action required is expected to be minimal for most workflows.
The following are the main steps to manually migrate to external transforms:
@external_systems()
decorator with source references, then remove any instances of the @use_external_systems()
decorator. This will likely involve updating any references to credentials in your transforms logic to instead reference credentials retrieved from the sources you are now importing.Transforms cannot contain both external transforms and their legacy version. To remedy this, you can migrate all legacy external transforms to use source-based external transforms instead (preferred), or split your transform into multiple transforms. Transforms can be split into one that uses the use_external_systems
decorator and another that uses the external_systems
decorator.
lightweight
external transformsExternal transforms are compatible with the @lightweight
decorator. Using this decorator can dramatically increase the execution speed for transforms operating on small and medium-sized data.
The below example shows how the @lightweight
decorator can be added to a transform along with the @external_systems
decorator. For more information on the options for configuring lightweight transforms, see the lightweight transforms documentation.
Copied!1@lightweight 2@external_systems( 3 poke_source=Source("ri.magritte..source.e301d738-b532-431a-8bda-fa211228bba6") 4)
External transforms support the generation and renewal of session credentials for the S3 source. This includes S3 sources configured with OIDC and Cloud Identity authentication. Below is an example of how to set up and begin using an S3 boto3 client.
Copied!1import boto3 2 3@external_systems( 4 s3_source=Source("ri.magritte..source.9a99cc8e-e76d-4490-847b-48b975b3d80b") 5) 6@transforms_df( 7 Output("ri.foundry.main.dataset.88359185-6ecd-4138-b4fd-aa8ae6d0df0c") 8) 9def compute(s3_source, ctx): 10 refreshable_credentials = s3_source.get_aws_credentials() 11 12 credentials = refreshable_credentials.get() 13 14 s3_client = boto3.client( 15 "s3", 16 aws_access_key_id=credentials.access_key_id, 17 aws_secret_access_key=credentials.secret_access_key, 18 aws_session_token=credentials.session_token, 19 )