Amazon S3

Connect Foundry to AWS S3 to read and sync data between S3 and Foundry.

Supported capabilities

CapabilityStatus
Exploration🟢 Generally available
Bulk import🟢 Generally available
Incremental🟢 Generally available for supported file formats
Media sets🟢 Generally available
Virtual tables🟢 Generally available
File exports🟢 Generally available

Setup

  1. Open the Data Connection application and select + New Source in the upper right corner of the screen.
  2. Select S3 from the available connector types.
  3. Choose to use a direct connection over the Internet or to connect through an intermediary agent.
  4. Follow the additional configuration prompts to continue the setup of your connector using the information in the sections below.

Learn more about setting up a connector in Foundry.

Connection details

OptionRequired?Description
URLYesURL of the S3 bucket. Data connection supports the s3a protocol. Should contain a trailing slash. See AWS's official documentation ↗ for more details.
For example: s3://bucket-name/
EndpointYesThe endpoint to use to access S3.
For example: s3.amazonaws.com or s3.us-east-1.amazonaws.com
RegionNoThe AWS region to use when configuring AWS services. This is required when using STS roles. Warning: Providing region together with an S3 endpoint also containing the region can cause failures.
For example: us-east-1 or eu-central-1
Network connectivityYes - for direct connection onlyStep 1: Foundry egress policy
Attach a Foundry egress policy to the bucket to allow Foundry to egress to S3. The Data Connection application suggests appropriate egress policies based on the connection details provided.
For example: bucket-name.s3.us-east-1.amazonaws.com (Port 443)

Step 2: AWS bucket policy
Additionally, you will need to allowlist the relevant Foundry IP and/or bucket details for access from S3. Your Foundry IP details can be found under Network Egress in the Control Panel application. See official AWS documentation ↗ for more details on how to configure bucket policies in S3.

Note: Setting up access to an S3 bucket hosted in the same region as your Foundry enrollment requires additional configuration. Read more about these requirements in the network egress documentation.
Client certificates & private keyNoClient certificates and private keys may or may not be required by your source to secure the connection.
Server certificatesNoServer certificates may or may not be required by your source to secure the connection.
CredentialsYesOption 1: Access key and secret
Provide the Access Key ID and Secret for connecting to S3.
Credentials can be generated by creating a new IAM User for Foundry in your AWS Account, and granting that IAM User access to the S3 bucket.

Option 2: OpenID Connect (OIDC)
Follow the displayed source system configuration instructions to set up OIDC. See official AWS documentation ↗ for details on OpenID Connect and our documentation for details on how OIDC works with Foundry.

See official AWS documentation ↗ for more details on creating an AWS IAM user. Review our documentation permissions for S3 for details on which AWS permissions Foundry expects the user to have.
STS roleNoThe S3 connector can optionally assume a Security Token Service (STS) role ↗ when connecting to S3. See STS role configuration for more details.
Connection timeoutNoThe amount of time to wait (in milliseconds) when initially establishing a connection before giving up and timing out.
Default: 50000
Socket timeoutNoThe amount of time to wait (in milliseconds) for data to be transferred over an established, open connection before the connection times out and is closed.
Default: 50000
Max connectionsNoThe maximum number of allowed open HTTP connections.
Default: 50
Max error retriesNoThe maximum number of retry attempts for failed retryable requests (ex: 5xx error responses from services).
Default: 3
Client KMS keyNoA KMS key name or alias used to perform client-side data encryption with the AWS SDK. Using this option on an agent in PCloud requires proxy changes.
Client KMS regionNoThe AWS region to use for the KMS client. Only relevant if a AWS KMS key is provided.
Match subfolder exactlyNoOptionally match the path specified under subfolder as an exact subfolder in S3. If set to false, both s3://bucket-name/foo/bar/ and s3://bucket-name/foo/bar_baz/ will be matched with a subfolder setting of foo/bar/.
Proxy configurationsYes - for agent-based connection onlyConfigure proxy settings for S3.
Note: this is required if (a) your Foundry enrollment is hosted in AWS, (b) you are connecting to an S3 bucket hosted in a different AWS region than your Foundry enrollment, and (c) you are connecting via a data connection agent. See S3 proxy configuration for more details.
Enable path style accessNoUse Path-style access URLs (for example,https://s3.region-code.amazonaws.com/bucket-name/key-name) instead of Virtual-hosted-style access URLs (for example, https://bucket-name.s3.region-code.amazonaws.com/key-name). See official AWS documentation ↗ for more details.
CatalogNoConfigure a catalog for tables stored in this S3 bucket. See Virtual tables for more details.

Required read and sync permissions for S3

The following AWS permission is required for interactive exploration of the S3 bucket:

Copied!
1 2 3 4 5 { "Action": ["s3:ListBucket"], "Resource": ["arn:aws:s3:::path/to/bucket"], "Effect": "Allow", }

The following AWS permission is required for batch syncs, virtual tables and media syncs from S3:

Copied!
1 2 3 4 5 { "Action": ["s3:GetObject"], "Resource": ["arn:aws:s3:::path/to/bucket/*"], "Effect": "Allow", }

See official AWS documentation on Policies and Permissions in Amazon S3 ↗ for more details on how to configure bucket policies in S3.

S3 proxy configuration (agent-based connections)

When connecting to S3 using a data connection agent, you can define proxy settings in two ways:

  • Source Config: Define each proxy setting directly in the source configuration, as outlined in the table below.
  • Agent's System Properties: As a fallback, you can configure the proxy settings within the agent's system properties. To achieve this, include the appropriate JVM arguments in the advanced configuration settings for the agent (for example, -Dhttps.proxyHost=example.proxy.com).
ParameterRequired?DefaultDescription
hostYHTTP proxy host (no scheme).
portYPort for HTTP proxy.
protocolNHTTPSThe protocol to use. Either HTTPS or HTTP.
nonProxyHostsNList of host names (or wild card domain names) that should not use the proxy. For example: `*.s3-external-1.amazonaws.com
credentialsNInclude this block if your proxy requires basic HTTP authentication (prompted by a HTTP 407 response ↗).
credentials.usernameNPlaintext username for the HTTP proxy.
credentials.passwordNEncrypted password for the HTTP proxy.

STS role configuration

STS role configuration allows you to make use of AWS Security Token Service ↗ to assume a role when reading from S3.

ParameterRequired?DefaultDescription
roleArnYSTS role ARN name.
roleSessionNameYThe session name to use when assuming this role.
roleSessionDurationN3600 secondsThe session duration.
externalIdNAn external ID to use when assuming a role.

Cloud identity configuration

Cloud identity authentication allows Foundry to access resources in your AWS instance. Cloud identities are configured and managed at the enrollment level in Control Panel. Learn how to configure cloud identities.

S3 Source With Cloud Identity

When using cloud identity authentication, the role ARN will be displayed in the credentials section. After selecting the Cloud identity credential option, you must also configure the following:

  1. Configure an Identity and Access Management (IAM) role in the target Amazon AWS account.
  2. Grant the IAM role access to the S3 bucket to which you wish to connect. You can generally do this with a bucket policy ↗.
  3. In the S3 source configuration details, add the IAM role under the Security Token Service (STS) role ↗ configuration. The cloud identity IAM role in Foundry will attempt to assume the AWS Account IAM role ↗ when accessing S3.
  4. Configure a corresponding trust policy ↗ to allow the cloud identity IAM role to assume the target AWS account IAM role.

Virtual tables

This section provides additional details around using virtual tables from an S3 source. This section is not applicable when syncing to Foundry datasets.

Virtual tables capabilityStatus
Source formats🟢 Generally available: Avro ↗, Delta ↗, Iceberg ↗, Parquet ↗
Manual registration🟢 Generally available
Automatic registration🔴 Not available
Pushdown compute🔴 Not available
Incremental pipeline support🟢 Generally available for Delta tables: APPEND only (details)
🟢 Generally available for Iceberg tables: APPEND only (details)
🔴 Not available for Parquet tables

When registering virtual tables, remember the following source configuration requirements:

  • You must set up the source as a direct connection. Virtual tables do not support use of intermediary agents.
  • Ensure that bi-directional connectivity and allowlisting is established as described under the Network Connectivity section in Connection details.
  • If using virtual tables in Code Repositories, refer to the Virtual Tables documentation for details of additional source configuration required.
  • If your bucket contains ., you must enable path-style access and set up the appropriate egress policy.

See the Connection Details section above for more details.

Delta

To enable incremental support for pipelines backed by virtual tables, ensure that Change Data Feed ↗ is enabled on the source Delta table. The current and added read modes in Python Transforms are supported. The _change_type, _commit_version and _commit_timestamp columns will be made available in Python Transforms.

Iceberg

An Iceberg catalog is required to load virtual tables backed by an Apache Iceberg table. To learn more about Iceberg catalogs, see the Apache Iceberg documentation ↗. All Iceberg tables registered on a source must use the same Iceberg catalog.

By default, tables will be created using Iceberg metadata files in S3. A warehousePath indicating the location of these metadata files must be provided when registering a table.

AWS Glue ↗ can be used as an Iceberg catalog when tables are stored in S3. To learn more about this integration, see the AWS Glue documentation ↗. The credentials configured on the source must have access to your AWS Glue Data Catalog. AWS Glue can be configured in the Connection Details tab on the source. All Iceberg tables registered on this source will automatically use AWS Glue as the catalog. Tables should be registered using database_name.table_name naming pattern.

Unity Catalog ↗ can be used as an Iceberg catalog when using Delta Universal Format (UniForm) in Databricks. To learn more about this integration, see the Databricks documentation ↗. As with AWS Glue, the catalog can be configured in the Connection Details tab on the source. You will need to provide the endpoint and a personal access token to connect to Unity Catalog. Tables should be registered using catalog_name.schema_name.table_name naming pattern.

Virtual Tables S3 Catalog

Incremental support relies on Iceberg Incremental Reads ↗ and is currently append-only. The current and added read modes in Python Transforms are supported.

Parquet

Virtual tables using Parquet rely on schema inference. At most 100 files will be used to determine the schema.

Export data to S3

To export to S3, first enable exports for your S3 connector. Then, create a new export.

Required export permissions for S3

The following AWS permission is required to export data to S3:

Copied!
1 2 3 4 5 { "Action": ["s3:PutObject"], "Resource": ["arn:aws:s3:::path/to/bucket/*"], "Effect": "Allow", }

See official AWS documentation on Policies and Permissions in Amazon S3 ↗ for more details on how to configure bucket policies in S3.

Export configuration options

OptionRequired?DefaultDescription
Path PrefixNoN/AThe path prefix that should be used for exported files. The full path for an exported file is calculated as s3://<bucket-name>/<path-in-source-config>/<path-prefix>/<exported-file>
Canned ACLNoN/ASet the AWS access control list (ACL) attached to the uploaded files, using one of the canned ACLs. See AWS documentation ↗ for a description of each ACL.