The S3-compatible API for Foundry datasets allows you to interact with Foundry datasets as though they are S3 buckets. Learn how datasets behave when accessed through the API, and view the setup guide and examples.
Foundry exposes a subset of the Amazon Simple Storage Service (S3) API ↗, allowing you to interact with Foundry datasets using clients that know how to speak to S3 storage services. Examples include the AWS CLI, AWS SDKs, Hadoop S3 filesystem, and Cyberduck.
The S3 API is not fully implemented as not all S3 concepts map naturally to concepts in Foundry. For example, creation and deletion of buckets (which represent Foundry datasets) is not currently supported; datasets should be created in Foundry ahead of using the API. However, the majority of file read/write/delete workflows are supported, including multipart uploads. See Supported actions for a list of which S3 actions are supported.
This section outlines how S3 concepts map to Foundry dataset concepts.
S3 buckets correspond to Foundry datasets, with the bucket name being the Foundry dataset's unique identifier (for example, ri.foundry.main.dataset.bafb7e96-84f1-4d23-a4a5-40b17c6912e7
).
S3 object keys correspond to the logical paths of files within a Foundry dataset (for example, top-level-file.csv
or subfolder/nested-file.csv
).
The API supports accessing dataset branches with alphanumeric names (containing only a-z
, A-Z
, or 0-9
) or the special characters -
and _
. To specify a branch, modify the bucket name by appending the branch name, separated by a period: <dataset-rid>.<branch-name>
. If no branch is specified, the default branch is used.
For example, to access the mybranch
branch of the dataset with RID ri.foundry.main.dataset.bafb7e96-84f1-4d23-a4a5-40b17c6912e7
, use the bucket name ri.foundry.main.dataset.bafb7e96-84f1-4d23-a4a5-40b17c6912e7.mybranch
.
S3 bucket name validation imposes the character restrictions previously described. If your branch contains characters that are not allowed such as /
, you can encode the branch name using Base64-encoding. You should omit any trailing =
characters. For example, feature/my-branch
is Base64 encoded as ZmVhdHVyZS9teS1icmFuY2g=
. ZmVhdHVyZS9teS1icmFuY2g
(without the =
character) can be used to reference this branch in the bucket name.
The API does not support branch creation; specified branches must already exist on the dataset.
The API supports accessing dataset transactions in a similar way to branches. This allows users to access historical versions of a dataset. For example, to access the ri.foundry.main.transaction.0cdfe8c9-f595-4859-a194-7daecff9d6fe
transaction of the dataset with RID ri.foundry.main.dataset.bafb7e96-84f1-4d23-a4a5-40b17c6912e7
, use the bucket name ri.foundry.main.dataset.bafb7e96-84f1-4d23-a4a5-40b17c6912e7.ri.foundry.main.transaction.0cdfe8c9-f595-4859-a194-7daecff9d6fe
.
Only committed transactions may be accessed in this way. As a result, the bucket will be read-only; it is not possible to put or delete objects when using a bucket name that includes a transaction identifier.
S3 does not have the notion of transactions, so Foundry dataset transactions are automatically handled with the following behavior:
UPDATE
transactions and deletes using DELETE
transactions.UPDATE
and DELETE
transaction to be active at the same time.UPDATE
and DELETE
transactions will be immediately committed where possible. This aims to preserve read-after-write semantics that are generally expected with S3. However, if reads are attempted while there are still active write requests to the open transaction, the read will happen from the latest committed view. To guarantee a transaction commits after all writes or deletes are complete, you can issue a subsequent read request prior to any new write or delete requests.Given the above behaviors, read-after-write semantics are not guaranteed. However, every effort is made to provide them where possible.
Connections through the API are authenticated using access key ID and secret access key credentials.
Static credentials are similar to standard AWS access key ID and secret access key credentials. Static credentials are long-lived and, in the Foundry case, are associated with the service user of a third-party application registered in Foundry's Control Panel.
When using a static access key ID and secret access key to connect to datasets through the S3-compatible API, the access level is determined by the access granted to the third-party application's service user. Static credentials must also be restricted to at least one Project. The Project restrictions are specified when generating a new set of credentials.
See the setup guide below for guidance on using the /io/s3/api/v2/credentials
API endpoint to generate these credentials.
We recommend using static credentials in any workflow that requires long-lived credentials or where it is beneficial to tie access to a service user. If you prefer authenticating to the API as your regular Foundry user, we support exchanging a user authentication token for temporary S3 credentials. This token could also be obtained through one of the OAuth2 grants for a third-party application.
Temporary credentials are obtained using the standard AssumeRoleWithWebIdentity ↗ Security Token Service (STS) API. We only require the WebIdentityToken
request parameter be present and configured with a regular Foundry token as described above. The temporary credentials returned will have an identical scope to that of the provided token. The DurationSeconds
parameter can be provided to specify the lifetime of the credentials. The credentials will have a maximum lifetime of one hour and will never exceed the lifetime of the Foundry token used to obtain temporary credentials.
The STS API can be accessed at https://<FOUNDRY_URL>/io/s3
. If you wish to obtain STS credentials programmatically, this URL should be configured in the endpoint configuration of the standard STS clients or credential providers. For example:
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13
import boto3 endpoint = "https://<FOUNDRY_URL>/io/s3" session = boto3.session.Session() client = session.client(service_name='sts', endpoint_url=endpoint) # RoleArn and RoleSessionName are required parameters in boto3 despite being unused credentials = client.assume_role_with_web_identity( RoleArn="xxxxxxxxxxxxxxxxxxxx", RoleSessionName="xxxxx", WebIdentityToken=token )["Credentials"]
Alternatively, you can access the API directly using cURL or equivalent such as in the example below. <TOKEN>
should be replaced with a valid Foundry token.
Copied!1 2
curl -X POST \ https://<FOUNDRY_URL>/io/s3?Action=AssumeRoleWithWebIdentity&WebIdentityToken=<TOKEN>
You will receive session credentials in the XML response, as shown below. These credentials should be securely stored.
Copied!1 2 3 4 5 6 7 8 9 10 11 12
<?xml version='1.0' encoding='UTF-8'?> <AssumeRoleWithWebIdentityResponse xmlns="https://sts.amazonaws.com/doc/2011-06-15/"> <AssumeRoleWithWebIdentityResult> <Credentials> <AccessKeyId>PLTRLZZJE0...</AccessKeyId> <SecretAccessKey>2j3hKX4EDP...</SecretAccessKey> <SessionToken>eyJwbG50ci...</SessionToken> <Expiration>2023-08-30T10:55:08.841403951Z</Expiration> </Credentials> </AssumeRoleWithWebIdentityResult> </AssumeRoleWithWebIdentityResponse>
Once you have temporary credentials, navigate to step four in the setup guide below for guidance on configuring S3 clients. You do not need to follow any steps regarding third-party applications. When configuring S3 clients, you must provide the session token, the access key ID, and secret access key.
To read or write data via the S3-compatible API, users need s3-proxy:datasets-read
and s3-proxy:datasets-write
permissions which, by default, are granted to the Viewer
and Editor
role respectively. When using static credentials, the service user corresponding to the third-party app will need to be granted the relevant role. When using temporary credentials, the user obtaining credentials will need the relevant role.
The API supports only path-style ↗ bucket access. Your bucket URLs will take the format https://<FOUNDRY_URL>/io/s3/<bucket-name>/<key-name>
.
In Foundry terms, this means https://<FOUNDRY_URL>/io/s3/<dataset-rid>/<logical-filepath>
.
Virtual-hosted-style bucket access (where the bucket name is included in the subdomain of the URL) is currently not supported.
Presigned URLs are supported for the putObject
operation.
Follow these steps to set up a connection to the API from your S3 client.
View example configuration settings for specific S3 clients.
To obtain credentials for the S3-compatible API you will first need to obtain the client ID and secret of a third-party application that has been created in Control Panel in Foundry:
Review Concepts: Authentication to understand the requirements and behavior for project restrictions, and the scope of generated credentials.
When you created the third-party application in the previous step, Foundry created a service user automatically. To access datasets via the S3 API, this service user must have sufficient permissions on the relevant Projects and Markings.
To set up permissions for the service user:
To generate credentials you will need to have the User experience administrator
role on the Organization in Control Panel.
Run the terminal command below (using either curl or Powershell) to receive an access key ID and secret access key. Replace <TOKEN>
with an active token for your user account, and <CLIENT_ID>
with the client ID of the third-party application generated in the previous step. Additionally, you must replace <PROJECT_RID>
with the RID of a Project to which the credentials have access. The projectRestrictions
value can take multiple Projects, allowing you to list more Project RIDs if necessary.
Copied!1 2 3 4 5
curl -X POST \ -H "Authorization: Bearer <TOKEN>" \ -H "Content-type: application/json" \ --data '{"clientId":"<CLIENT_ID>","projectRestrictions":["<PROJECT_RID>"]}' \ https://<FOUNDRY_URL>/io/s3/api/v2/credentials
Copied!1 2 3 4 5 6 7 8 9 10
$headers = @{ "Authorization" = "Bearer <TOKEN>" "Content-type" = "application/json" } $body = @{ "clientId" = "<CLIENT_ID>" "projectRestrictions" = @("<PROJECT_RID>") } | ConvertTo-Json Invoke-WebRequest -Uri "https://<FOUNDRY_URL>/io/s3/api/v2/credentials" -Method POST -Headers $headers -Body $body
Securely store the access key and secret key you receive in the response of this request. You must configure clients with these credentials, not the third-party application's client ID and secret.
By default, the /v2/credentials
endpoint assumes the authenticating user is generating credentials for a third-party application in their own Organization. If the third-party application exists in a different Organization, specify the Organization ID as a query parameter in the URL: https://<FOUNDRY_URL>/io/s3/api/v2/credentials?organizationRid=<ORGANIZATION_ID>
.
If you need to revoke an access key and secret access key, call the following endpoint and replace <ACCESS_KEY_ID>
with the access key ID that you wish to revoke:
Copied!1 2 3
curl -X DELETE \ -H "Authorization: Bearer <TOKEN>" \ https://<FOUNDRY_URL>/io/s3/api/v2/credentials/<ACCESS_KEY_ID>
To configure an S3 client you must set the following configuration parameters. See the examples below for details on how these should be configured in common S3 clients.
Name | Value | Description |
---|---|---|
Hostname / Endpoint | https://<FOUNDRY_URL>/io/s3 | The hostname to which clients should connect (rather than s3.amazonaws.com for native S3 buckets hosted in AWS). |
Region | foundry | The region must be set to foundry as it is used as part of the V4 signature verification ↗ process. If clients can only use standard AWS regions then use us-east-1 . |
Credentials | Access key ID and secret access key, and optionally, session token | Static or temporary credentials generated as described above. |
Path-style access | true | The API only supports path-style ↗ bucket access rather than virtual-hosted-style bucket access. |
Bucket Name (Optional) | ri.foundry.main.dataset.<uuid> | Each Foundry dataset is accessible as a separate S3 bucket, with the bucket name being the dataset's RID. |
The following S3 actions ↗ are supported:
As discussed above, you must ensure the client/SDK/connector is configured to use path-style bucket access. If the client does not support path-style bucket access, it is currently not compatible with this API.
For example, with the S3A Hadoop client ↗
this can be configured using the fs.s3a.path.style.access
flag.
If you are using temporary credentials, be sure to also configure the AWS session access token. Consult the relevant AWS client documentation for details. For example, for the AWS CLI ↗ you should set the AWS_SESSION_TOKEN
environment variable.
Once you have an access key ID and secret access key, you are ready to configure the AWS CLI ↗. Run the following command, entering the access key ID, secret access key, and region.
$ aws configure --profile foundry
AWS Access Key ID [None]: <ACCESS_KEY_ID>
AWS Secret Access Key [None]: <SECRET_ACCESS_KEY>
Default region name [None]: foundry
Default output format [None]:
You should now be able to run commands for the foundry
profile. For example:
Copied!1
aws --profile foundry --endpoint-url https://<FOUNDRY_URL>/io/s3 s3 ls s3://<DATASET_RID>
As of a recent release ↗ of the AWS CLI, it is now possible to configure the endpoint-url
as part of the profile
configuration. A sample foundry
profile as it would be configured in ~/.aws/config
is shown below. When configuring a profile with the endpoint_url
property, you no longer need to include the --endpoint-url
argument when using the aws
command. Instead, --profile
is sufficient.
Copied!1 2 3
[profile foundry] region = foundry endpoint_url = https://<FOUNDRY_URL>/io/s3
To use temporary credentials with the AWS CLI, follow the AWS documentation ↗ that explains how to configure the CLI to make the AWS STS AssumeRoleWithWebIdentity
call for you. A sample foundry
profile as it would be configured in ~/.aws/config
is shown below. When using this configuration you do not need to have configured credentials in environment variables (AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
, AWS_SESSION_TOKEN
) or ~/.aws/credentials
.
Copied!1 2 3 4 5
[profile foundry] region = foundry endpoint_url = https://<FOUNDRY_URL>/io/s3 web_identity_token_file = ~/.foundry/web-identity.token role_arn=xxxxxxxxxxxxxxxxxxxx
The example above assumes a valid Foundry token is stored in a file at ~/.foundry/web-identity.token
. We only recommend this approach if this file is properly secure and not at risk of being leaked. The role_arn
property is not used but must still be provided and be at least 20 characters long due to AWS CLI validations. We use xxxxxxxxxxxxxxxxxxxx
as a placeholder in the example. To use this configuration, you must configure the endpoint_url
in ~/.aws/config
rather than using --endpoint-url
, as discussed above.
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
import boto3 import pandas as pd s3 = boto3.client( 's3', aws_access_key_id="<ACCESS_KEY_ID>", aws_secret_access_key="<SECRET_ACCESS_KEY>", # session_token="<SESSION_TOKEN>", only needed when using temporary credentials endpoint_url="https://<FOUNDRY_URL>/io/s3", region_name="foundry" ) bucket = 'ri.foundry.main.dataset.<uuid>' key = 'iris.csv' obj = s3.get_object(Bucket=bucket, Key=key) df = pd.read_csv(obj['Body']) print(df)
Review the S3 section of Boto3's documentation ↗ for more information on connecting to S3-compatible sources using Boto3.
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
from pyspark.sql import SparkSession hostname = "https://<FOUNDRY_URL>/io/s3" access_key_id = "<ACCESS_KEY_ID>" secret_access_key = "<SECRET_ACCESS_KEY>" # ensure dataset RID can be parsed as a valid hostname dataset_rid = "ri.foundry.main.dataset.<uuid>".replace('.', '-') spark_session = ( SparkSession.builder .config("fs.s3a.access.key", access_key_id) .config("fs.s3a.secret.key", secret_access_key) # .config("fs.s3a.session.token", session_token) only needed when using temporary credentials .config("fs.s3a.endpoint", hostname) .config("fs.s3a.endpoint.region", "foundry") .config("fs.s3a.path.style.access", "true") .getOrCreate() ) df = spark_session.read.parquet(f"s3a://{dataset_rid}/*") df.show()
Review the Spark documentation ↗ for more information on using Spark with S3.
There is a known issue using the Hadoop AWS client with bucket names that contain '.'. You may encounter an error message such as "bucket is null/empty". If this occurs, the dataset RID cannot be parsed as a valid hostname. As a workaround, you can substitute '.' in the dataset RID with '-'.
Download the following profile: Foundry S3.cyberduckprofile ↗
Double-click the profile file to open and register the profile in Cyberduck.
Set the following connection properties on the default bookmark that Cyberduck created for you:
https://<FOUNDRY_URL>
<ACCESS_KEY_ID>
<SECRET_ACCESS_KEY>
ri.foundry.main.dataset.<uuid>
Close the bookmark settings window.
Double-click the bookmark to open a connection.
Review the Cyberduck documentation ↗ for more information on connecting to S3-compatible sources.
Google Cloud's Storage Transfer Service ↗ can treat Foundry as an S3-compatible source ↗. You can transfer data from a Foundry dataset to a Cloud Storage bucket by following the below steps.
Set up an agent pool and transfer agents by following Google Cloud's instructions ↗.
Create a transfer job, and select S3-compatible object storage as the Source type. Then select the agent pool you created in step (1) and ensure the following configuration is set:
ri.foundry.main.dataset.<uuid>
.https://<FOUNDRY_URL>/io/s3
foundry
HTTPS
ListObjectsV2
Complete the setup to configure the Cloud Storage bucket destination, schedule, and settings of your transfer job.You can use Apache NiFi to read and write files inside a Foundry dataset. The following example shows how to configure the PutS3Object
processor type for writing:
path/to/file.csv
ri.foundry.main.dataset.<uuid>
<ACCESS_KEY_ID>
<SECRET_ACCESS_KEY>
us-east-1
true
https://<FOUNDRY_URL>/io/s3
Refer to the Apache NiFi documentation ↗ for more information on the PutS3Object
processor and other processors that support S3-compatible sources.
Airbyte ↗'s support for S3 destinations can be used to write files to Foundry datasets. Set the following destination settings:
<ACCESS_KEY_ID>
<SECRET_ACCESS_KEY>
ri.foundry.main.dataset.<uuid>
us-east-1
https://<FOUNDRY_URL>/io/s3
Refer to Airbyte's documentation for S3 destinations ↗ for more information and configuration options.
DuckDB ↗'s support for S3 can be used to query Foundry datasets. You can manage credentials using DuckDB secrets and query datasets using the s3://
prefixed URLs.
CREATE SECRET foundry_secret (
TYPE S3,
KEY_ID '<ACCESS_KEY_ID>',
SECRET '<SECRET_ACCESS_KEY>',
REGION 'foundry',
ENDPOINT '<FOUNDRY_URL>/io/s3',
URL_STYLE 'path'
);
CREATE TABLE new_tbl AS SELECT * FROM 's3://ri.foundry.main.dataset.<uuid>/spark/*.parquet';
Refer to the DuckDB documentation ↗ for more information.
In the secret configuration above, the ENDPOINT
configuration parameter should not include the https://
scheme. The URL scheme is handled automatically by the USE_SSL
parameter, which defaults to true
.
Polars' ↗ support for S3 can be used to query Foundry datasets.
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
import polars as pl hostname = "https://<FOUNDRY_URL>/io/s3" access_key_id = "<ACCESS_KEY_ID>" secret_access_key = "<SECRET_ACCESS_KEY>" dataset_rid = "ri.foundry.main.dataset.<uuid>" storage_options = { "aws_access_key_id": access_key_id, "aws_secret_access_key": secret_access_key, "aws_region": "foundry", "endpoint_url": hostname } df = pl.scan_parquet(f"s3://{dataset_rid}/spark/*.parquet", storage_options=storage_options)
Refer to the Polars documentation ↗ for more information.