Google Cloud Storage

Connect Foundry to Google Cloud Storage to sync files between Foundry datasets and storage buckets.

Supported capabilities

CapabilityStatus
Exploration🟢 Generally available
Bulk import🟢 Generally available
Incremental🟢 Generally available
Virtual tables🟢 Generally available
Export tasks🟡 Sunset
File exports🟢 Generally available

Data model

The connector can transfer files of any type into Foundry datasets. File formats are preserved, and no schemas are applied during or after the transfer. Apply any necessary schema to the output dataset, or write a downstream transformation to access the data.

Performance and limitations

There is no limit to the size of transferable files. However, network issues can result in failures of large-scale transfers. In particular, direct cloud syncs that take more than two days to run will be interrupted. To avoid network issues, we recommend using smaller file sizes and limiting the number of files that are ingested in every execution of the sync. Syncs can be scheduled to run frequently.

Setup

  1. Open the Data Connection app and select + New Source in the upper right corner of the screen.
  2. Select Google Cloud Storage from the available connector types.
  3. Choose to use a direct connection over the internet or to connect through an intermediary agent.
  4. Follow the additional configuration prompts to continue the set up of your connector using the information in the sections below.

Learn more about setting up a connector in Foundry.

You must have a Google Cloud IAM service account ↗ to proceed with Google Cloud Storage authentication and set up.

Authentication

The following roles are required on the bucket being accessed:

  • Storage Object Viewer: Read data
  • Storage Object Creator: Export data to Google Cloud Storage
  • Storage Object Admin: Delete files from Google Cloud Storage after importing them into Foundry.

Learn more about required roles in the Google Cloud documentation on access control ↗.

Choose from one of the available authentication methods:

  • GCP instance account: Refer to the Google Cloud documentation ↗ for information on how to set up instance-based authentication.

    • Note that GCP instance authentication only works for connectors operating through agents that run on appropriately configured instances in GCP.
  • Service account key file: Refer to the Google Cloud documentation ↗ for information on how to set up service account key file authentication.

  • Workload Identity Federation (OIDC): Follow the displayed source system configuration instructions to set up OIDC. Refer to the Google Cloud Documentation ↗ for details on Workload Identity Federation and our documentation for details on how OIDC works with Foundry.

Networking

The Google Cloud Storage connector requires network access to the following domains on port 443:

  • storage.googleapis.com
  • oauth2.googleapis.com
  • accounts.google.com

Configuration options

The following configuration options are available for the Google Cloud Storage connector:

OptionRequired?Description
Project IdYesThe ID of the Project containing the Cloud Storage bucket.
Bucket nameYesThe name of the bucket to read/write data to and from.
Credentials settingsYesConfigure using the Authentication guidance shown above.
Proxy settingsNoEnable to use a proxy while connecting to Google Cloud Storage.

Sync data from Google Cloud Storage

The Google Cloud Storage connector uses the file-based sync interface. See documentation on configuring file-based syncs.

Virtual tables

This section provides additional details around using virtual tables from Google Cloud Storage source. This section is not applicable when syncing to Foundry datasets.

Virtual tables capabilityStatus
Source formats🟢 Generally available: Avro ↗, Delta ↗, Iceberg ↗, Parquet ↗
Manual registration🟢 Generally available
Automatic registration🔴 Not available
Pushdown compute🔴 Not available
Incremental pipeline support🟢 Generally available for Delta tables: APPEND only (details)
🟢 Generally available for Iceberg tables: APPEND only (details)
🔴 Not available for Parquet tables

When using virtual tables, remember the following source configuration requirements:

  • You must set up the source as a direct connection. Virtual tables do not support the use of intermediary agents.
  • Ensure that bi-directional connectivity and allowlisting is established as described in the Networking section of this documentation.
  • If using virtual tables in Code Repositories, refer to the Virtual Tables documentation for details of additional source configuration required.
  • When setting up the source credentials, you must use one of JSON Credentials, PKC8 Auth or Workload Identity Federation (OIDC). Other credential options are not supported when using virtual tables.

Delta

To enable incremental support for pipelines backed by virtual tables, ensure that Change Data Feed ↗ is enabled on the source Delta table. The current and added read modes in Python Transforms are supported. The _change_type, _commit_version and _commit_timestamp columns will be made available in Python Transforms.

Iceberg

An Iceberg catalog is required to load virtual tables backed by an Apache Iceberg table. To learn more about Iceberg catalogs, see the Apache Iceberg documentation ↗. All Iceberg tables registered on a source must use the same Iceberg catalog.

Tables will be created using Iceberg metadata files in GCS. A warehousePath indicating the location of these metadata files must be provided when registering a table.

Incremental support relies on Iceberg Incremental Reads ↗ and is currently append-only. The current and added read modes in Python Transforms are supported.

Parquet

Virtual tables using Parquet rely on schema inference. At most 100 files will be used to determine the schema.

Export data to Google Cloud Storage

The connector can copy files from a Foundry dataset to any location on the Google Cloud Storage bucket.

To begin exporting data, you must configure an export task. Navigate to the Project folder that contains the Google Cloud Storage connector to which you want to export. Right select on the connector name, then select Create Data Connection Task.

In the left panel of the Data Connection view:

  1. Verify the Source name matches the connector you want to use.
  2. Add an Input named inputDataset. The input dataset is the Foundry dataset being exported.
  3. Add an Output named outputDataset. The output dataset is used to run, schedule, and monitor the task.
  4. Finally, add a YAML block in the text field to define the task configuration.

The labels for the connector and input dataset that appear in the left side panel do not reflect the names defined in the YAMl.

Use the following options when creating the export task YAML:

OptionRequired?Description
directoryPathYesThe directory in Cloud Storage where files will be written.
excludePathsNoA list of regular expressions; files with names matching these expressions will not be exported.
uploadConfirmationNoWhen the value is exportedFiles, the output dataset will contain a list of files that were exported.
retriesPerFileNoIf experiencing network failures, increase this number to allow the export job to retry uploads to Cloud Storage before failing the entire job.
createTransactionFoldersNoWhen enabled, data will be written to a subfolder within the specified directoryPath. Every subfolder is based on the time the transaction was committed in Foundry and has a unique name for every exported transaction.
threadsNoSet the number of threads used to upload files in parallel. Increase the number to use more resources. Ensure that exports running on agents have enough resources on the agent to handle increased parallelization.
incrementalTypeNoFor datasets that are built incrementally, set to incremental to only export transactions that occurred since the previous export.

Example task configuration:

Copied!
1 2 3 4 5 6 7 8 9 10 type: export-google-cloud-storage directoryPath: directory/to/export/to excludePaths: - ^_.* - ^spark/_.* uploadConfirmation: exportedFiles incrementalType: incremental retriesPerFile: 0 createTransactionFolders: true threads: 0

After you configure the export task, select Save in the upper right corner.