Connect Foundry to Google Cloud Storage to sync files between Foundry datasets and storage buckets.
Capability | Status |
---|---|
Exploration | 🟢 Generally available |
Bulk import | 🟢 Generally available |
Incremental | 🟢 Generally available |
Virtual tables | 🟢 Generally available |
Export tasks | 🟡 Sunset |
File exports | 🟢 Generally available |
The connector can transfer files of any type into Foundry datasets. File formats are preserved, and no schemas are applied during or after the transfer. Apply any necessary schema to the output dataset, or write a downstream transformation to access the data.
There is no limit to the size of transferable files. However, network issues can result in failures of large-scale transfers. In particular, direct cloud syncs that take more than two days to run will be interrupted. To avoid network issues, we recommend using smaller file sizes and limiting the number of files that are ingested in every execution of the sync. Syncs can be scheduled to run frequently.
Learn more about setting up a connector in Foundry.
You must have a Google Cloud IAM service account ↗ to proceed with Google Cloud Storage authentication and set up.
The following roles are required on the bucket being accessed:
Storage Object Viewer
: Read dataStorage Object Creator
: Export data to Google Cloud StorageStorage Object Admin
: Delete files from Google Cloud Storage after importing them into Foundry.Learn more about required roles in the Google Cloud documentation on access control ↗.
Choose from one of the available authentication methods:
GCP instance account: Refer to the Google Cloud documentation ↗ for information on how to set up instance-based authentication.
Service account key file: Refer to the Google Cloud documentation ↗ for information on how to set up service account key file authentication.
Workload Identity Federation (OIDC): Follow the displayed source system configuration instructions to set up OIDC. Refer to the Google Cloud Documentation ↗ for details on Workload Identity Federation and our documentation for details on how OIDC works with Foundry.
The Google Cloud Storage connector requires network access to the following domains on port 443:
storage.googleapis.com
oauth2.googleapis.com
accounts.google.com
The following configuration options are available for the Google Cloud Storage connector:
Option | Required? | Description |
---|---|---|
Project Id | Yes | The ID of the Project containing the Cloud Storage bucket. |
Bucket name | Yes | The name of the bucket to read/write data to and from. |
Credentials settings | Yes | Configure using the Authentication guidance shown above. |
Proxy settings | No | Enable to use a proxy while connecting to Google Cloud Storage. |
The Google Cloud Storage connector uses the file-based sync interface. See documentation on configuring file-based syncs.
This section provides additional details around using virtual tables from Google Cloud Storage source. This section is not applicable when syncing to Foundry datasets.
Virtual tables capability | Status |
---|---|
Source formats | 🟢 Generally available: Avro ↗, Delta ↗, Iceberg ↗, Parquet ↗ |
Manual registration | 🟢 Generally available |
Automatic registration | 🔴 Not available |
Pushdown compute | 🔴 Not available |
Incremental pipeline support | 🟢 Generally available for Delta tables: APPEND only (details)🟢 Generally available for Iceberg tables: APPEND only (details)🔴 Not available for Parquet tables |
When using virtual tables, remember the following source configuration requirements:
JSON Credentials
, PKC8 Auth
or Workload Identity Federation (OIDC)
. Other credential options are not supported when using virtual tables.To enable incremental support for pipelines backed by virtual tables, ensure that Change Data Feed ↗ is enabled on the source Delta table. The current
and added
read modes in Python Transforms are supported. The _change_type
, _commit_version
and _commit_timestamp
columns will be made available in Python Transforms.
An Iceberg catalog is required to load virtual tables backed by an Apache Iceberg table. To learn more about Iceberg catalogs, see the Apache Iceberg documentation ↗. All Iceberg tables registered on a source must use the same Iceberg catalog.
Tables will be created using Iceberg metadata files in GCS. A warehousePath
indicating the location of these metadata files must be provided when registering a table.
Incremental support relies on Iceberg Incremental Reads ↗ and is currently append-only. The current
and added
read modes in Python Transforms are supported.
Virtual tables using Parquet rely on schema inference. At most 100 files will be used to determine the schema.
The connector can copy files from a Foundry dataset to any location on the Google Cloud Storage bucket.
To begin exporting data, you must configure an export task. Navigate to the Project folder that contains the Google Cloud Storage connector to which you want to export. Right select on the connector name, then select Create Data Connection Task
.
In the left panel of the Data Connection view:
Source
name matches the connector you want to use.Input
named inputDataset
. The input dataset is the Foundry dataset being exported.Output
named outputDataset
. The output dataset is used to run, schedule, and monitor the task.The labels for the connector and input dataset that appear in the left side panel do not reflect the names defined in the YAMl.
Use the following options when creating the export task YAML:
Option | Required? | Description |
---|---|---|
directoryPath | Yes | The directory in Cloud Storage where files will be written. |
excludePaths | No | A list of regular expressions; files with names matching these expressions will not be exported. |
uploadConfirmation | No | When the value is exportedFiles , the output dataset will contain a list of files that were exported. |
retriesPerFile | No | If experiencing network failures, increase this number to allow the export job to retry uploads to Cloud Storage before failing the entire job. |
createTransactionFolders | No | When enabled, data will be written to a subfolder within the specified directoryPath . Every subfolder is based on the time the transaction was committed in Foundry and has a unique name for every exported transaction. |
threads | No | Set the number of threads used to upload files in parallel. Increase the number to use more resources. Ensure that exports running on agents have enough resources on the agent to handle increased parallelization. |
incrementalType | No | For datasets that are built incrementally, set to incremental to only export transactions that occurred since the previous export. |
Example task configuration:
Copied!1 2 3 4 5 6 7 8 9 10
type: export-google-cloud-storage directoryPath: directory/to/export/to excludePaths: - ^_.* - ^spark/_.* uploadConfirmation: exportedFiles incrementalType: incremental retriesPerFile: 0 createTransactionFolders: true threads: 0
After you configure the export task, select Save in the upper right corner.