Documentation

Data connectivity & integrationAvailable connectorsBigQuery

BigQuery

Connect Foundry to Google BigQuery to read and sync data between BigQuery tables and Foundry datasets.

Supported capabilities

Capability	Status
Exploration	🟢 Generally available
Bulk import	🟢 Generally available
Incremental	🟢 Generally available
Virtual tables	🟢 Generally available
Export tasks	🟡 Sunset

Data model

Tables from BigQuery are imported into Foundry, and the data is saved in Avro format. Columns of type BIGNUMERIC and TIME are not supported at the time of import.

When exporting data from Foundry to BigQuery, all column types except for MAPS, STRUCTS and ARRAYS are supported.

Performance and limitations

When syncing more than 20GB of data from a non-standard BigQuery table, temporary storage tables must be enabled. A single sync can import data up to the size available on the disk. For syncs run through direct connections, this is typically limited to 600GB. Use incremental syncs to import larger tables.

Setup

Open the Data Connection application and select + New Source in the upper right corner of the screen.
Select BigQuery from the available connector types.
Choose to use a direct connection over the Internet or to connect through an intermediary agent.
Follow the additional configuration prompts to continue the setup of your connector using the information in the sections below.

Learn more about setting up a connector in Foundry.

You must have a Google Cloud IAM service account ↗ to proceed with BigQuery authentication and set up.

Authentication

The following Identity and Access Management (IAM) roles are required to use a BigQuery connector:

Read BigQuery data:

BigQuery Read Session User: Granted on BigQuery Project
BigQuery Data Viewer: Granted on BigQuery data to read data and metadata
BigQuery Job User (optional): Granted to ingest views and run custom queries

Export data to BigQuery from Foundry:

BigQuery Data Editor: Granted on BigQuery dataset or Project
BigQuery Job User: Granted on BigQuery Project
Storage Object Admin: Granted on bucket if data is exported with Google Cloud Storage

Use temporary tables:

BigQuery Data Editor: Granted for BigQuery Project if dataset is automatically created by the connector
BigQuery Data Editor: Granted for the provided dataset as a place to store temporary tables

Learn more about required roles in the Google Cloud documentation on access control ↗.

Choose from one of the available authentication methods:

GCP instance account: Refer to the Google Cloud documentation ↗ for information on how to set up instance-based authentication.
- Note that GCP instance authentication only works for connectors operating through agents that run on appropriately configured instances in GCP.
- Note that virtual tables do not support GCP instance authentication credentials.
Service account key file: Refer to the Google Cloud documentation ↗ for information on how to set up service account key file authentication. The key file can be provided as JSON or PKCS8 credentials.
Workload Identity Federation (OIDC): Follow the displayed source system configuration instructions to set up OIDC. Refer to the Google Cloud Documentation ↗ for details on Workload Identity Federation and our documentation for details on how OIDC works with Foundry.

Networking

The BigQuery connector requires network access to the following domains on port 443:

bigquery.googleapis.com
bigquerystorage.googleapis.com
storage.googleapis.com
www.googleapis.com

Additional access may be required to the following domains:

oauth2.googleapis.com
accounts.google.com

If you are establishing a direct connection between Foundry on Google Cloud Platform (GCP) with BigQuery on GCP, you must also enable the connection through the relevant VPC service controls. If this connection is required for your setup, contact Palantir Support for additional guidance.

Connection details

The following configuration options are available for the BigQuery connector:

Option	Required?	Description
`Project ID`	Yes	The ID of the BigQuery Project; This Project will be charged for BigQuery compute usage regardless of what data is synced
`Credentials settings`	Yes	Configure using the Authentication guidance shown above.
`Cloud Storage bucket`	No	Add a name of a Cloud Storage bucket to be used as a staging location for writing data to BigQuery.
`Proxy settings`	No	Enable to allow a proxy connection to BigQuery.
`Settings for temporary tables`	No*	Enable to use temporary tables.
`gRPC Settings`	No	Advanced settings used to configure gRPC channels.
`Additional projects`	No	Add the IDs of any additional Projects that must be accessed by the same connection; the Google Cloud account used as credentials for this connector will need to have access to these Projects. The connector `Project Id` will be charged for any BigQuery data access or compute usage.

* Temporary tables must be enabled when registering BigQuery views ↗ via virtual tables.

Sync data from BigQuery

To set up a BigQuery sync, select Explore and create syncs in the upper right of the source Overview screen. Next, select the tables you want to sync into Foundry. When you are ready to sync, select Create sync for x datasets.

Learn more about source exploration in Foundry.

Temporary tables

When syncing more than 20GB of data from a non-standard BigQuery table, temporary tables must be enabled. Temporary tables allow BigQuery to import results from large query outputs, import data from views and other non-standard tables, and allow for large incremental imports and ingests.

To use temporary tables, enable Settings for temporary tables in the connector configuration.

Automatically create dataset: Select this option to create a Palantir_temporary_tables dataset to store temporary tables. This option requires the BigQuery account to have the BigQuery Data Editor role on the Project.
Provide dataset to use: Select this option to manually add the Project ID and Dataset name of the dataset you want to use to store temporary tables. This option requires the BigQuery account to have the BigQuery Data Editor role on the dataset provided.

Note that temporary tables must be enabled when using BigQuery views ↗ with virtual tables.

Configure BigQuery syncs

The BigQuery connector allows for advanced sync configurations for large data syncs and custom queries.

After exploring your available syncs and adding them to your connector, navigate to Edit syncs. From the Syncs panel to the left, find the sync you want to configure and select > to the right.

Import settings

Choose what data will be synced from BigQuery into Foundry.

Sync full table

Enter the following information to sync entire tables to Foundry:

Option	Required?	Description
`BigQuery project Id`	No	The ID of the Project to which the table belongs.
`BigQuery dataset`	Yes	The name of the dataset to which the table belongs.
`BigQuery table`	No	The name of the table being synced into Foundry.

Custom SQL

Any arbitrary query can be run, and the results will be saved in Foundry. The query output must be smaller than 20GB (the maximum BigQuery table size), or temporary table usage must be enabled. Queries must start with the keyword select or with. For example: SELECT * from table_name limit 100;.

Incremental syncs

Typically, syncs will import all matching rows from the target table, regardless if data changed between syncs or not. Incremental syncs, by contrast, maintain state about the most recent sync and only ingest new matching rows from the target.

Incremental syncs can be used when ingesting large tables from BigQuery. To use incremental syncs, the table must contain a column that is strictly monotonically increasing. Additionally, the table or query being read from must contain a column with one of the following data types:

INT64
FLOAT64
NUMERIC
BIGNUMERIC
STRING
TIMESTAMP
DATE
TIME
DATETIME

Example: A 5 TB table contains billions of rows that we want to sync to BigQuery. The table has a monotonically increasing column called id. The sync can be configured to ingest 50 million rows at a time using the id column as the incremental column, with an initial value of -1 and a configured limit of 50 million rows.

When a sync is initially run, the first 50 million rows (ascending based on id) containing an id value greater than -1 will be ingested into Foundry. For example, if this sync was run several times and the largest id that was ingested during the last run of this sync was 19384004822, the next sync will ingest the next 50 million rows starting with the first id greater than 19384004822, and so on.

Incremental syncs require the following configurations:

Option	Required?	Description
Column	Yes	Select the column that will be used for incremental ingests. The dropdown will be empty if the table does not contain any supported columns types.
`Initial value`	Yes	The value from which to start syncing data.
`Limit`	No	The number of records to download in a single sync.

Incremental syncs for custom queries

To enable incremental queries for custom query syncs, the query must be updated to match the following format:

SELECT * from table_name where incremental_column_name > @value order by incremental_column_name asc limit @limit

Export data to BigQuery

The connector can export to BigQuery in two ways:

Through Google Cloud Storage as an intermediary store (recommended). See the section below on exporting via Cloud Storage for more information.
Through BigQuery APIs see Reading data from local files ↗ for details on API used

Writing through APIs is suitable for a data scale of several million rows. The expected performance for this mode is to export one million rows in approximately two minutes. If your data scale range reaches billions of rows, use Google Cloud Storage instead.

Task configuration

To begin exporting data, you must configure an export task. Navigate to the Project folder that contains the connector to which you want to export. Right select on the connector name, then select Create Data Connection Task.

In the left side panel of the Data Connection view, verify the Source name matches the connector you want to use. Then, add an input dataset to the Input field. The input dataset must be called inputDataset, and it is the Foundry dataset being exported. The export task also requires one Output to be configured. The output dataset must be called outputDataset and point to a Foundry dataset. The output dataset is used to run, schedule, and monitor the task.

In the left panel of the Data Connection view:

Verify the Source name matches the connector you want to use.
Add an Input named inputDataset. The input dataset is the Foundry dataset being exported.
Add an Output named outputDataset. The output dataset is used to run, schedule, and monitor the task.
Finally, add a YAML block in the text field to define the task configuration.

The labels for the connector and input dataset that appear in the left side panel do not reflect the names defined in the YAMl.

Use the following options when creating the export task YAML:

Option	Required?	Default	Description
`project`	No	The Project ID of the connector	The ID of the Project that the destination table belongs to.
`dataset`	Yes		The name of the dataset to which the table belongs.
`table`	Yes		The name of the table to which the data will be exported.
`incrementalType`	Yes	SNAPSHOT	The value can either be SNAPSHOT or REQUIRE_INCREMENTAL. • Export in snapshot mode: Contents of the destination table will be replaced. • Export in incremental mode: Contents will be appended to the existing table.

To set up incremental exports, run the first export as SNAPSHOT, then change incrementalType to REQUIRE_INCREMENTAL.

Example task configuration:

Copied!1
2
3
4
5
6
7
type: magritte-bigquery-export-task
config:
    table:
      dataset: datasetInBigQuery
      table: tableInBigQuery
      project: projectId #(Optional: Do not specify unless the Project for export differs from the Project configured for the connector.)
    incrementalType: SNAPSHOT | REQUIRE_INCREMENTAL

Only datasets containing rows stored in Parquet format are supported for export to BigQuery.

After you configure the export task, select Save in the upper right corner.

Append rows to output table

If you need to append rows to the destination table, you can use REQUIRE_INCREMENTAL rather than replacing the dataset.

Incremental syncs require that rows are only appended to the input dataset and that the destination table in BigQuery matches the schema of the input dataset in Foundry.

Export via Cloud Storage (recommended)

To export via Cloud Storage, a Cloud Storage bucket must be configured in the connector configuration settings. Additionally, the Cloud Storage bucket must only be used for temporary tables for the connector so that any data temporarily written to the bucket is accessible to the least amount of users possible.

We recommend exporting to BigQuery via Cloud Storage rather than BigQuery APIs; Cloud Storage operates better at scale and does not create temporary tables in BigQuery.

Export via BigQuery APIs

The export job creates a temporary table alongside the destination table; this temporary table will not have extra access restrictions applied. Additionally, the SNAPSHOT export drops and recreates the table, meaning the extra access restrictions will also be dropped.

During an export via BigQuery APIs, data is exported to the temporary table datasetName.tableName_temp_$timestamp. Once the export is complete, rows are automatically transferred from the temporary table to the destination table.

Hive table partitions are not supported for export through BigQuery APIs. For datasets partitioned with Hive tables, export through Cloud Storage instead.

You can share the destination table if the export is run in REQUIRE_INCREMENTAL mode. Running in SNAPSHOT mode recreates the table on each run, and the sharing would need to be reapplied.

To successfully export via BigQuery APIs, do not apply BigQuery row-level or column-level permissions to the exported table.

Virtual tables

This section provides additional details around using virtual tables with a BigQuery source. This section is not applicable when syncing to Foundry datasets.

Virtual tables capability	Status
Source formats	🟢 Generally available: tables, views, and materialized views
Manual registration	🟢 Generally available
Automatic registration	🟢 Generally available
Pushdown compute	🟢 Generally available; available via the BigQuery Spark connector ↗
Incremental pipeline support	🟢 Generally available: `APPEND` only [1]

It is critical to ensure that incremental pipelines backed by virtual tables are built on APPEND-only source tables, as BigQuery does not provide UPDATE or DELETE change information. See official BigQuery documentation ↗ for more details.

When using virtual tables, remember the following source configuration requirements:

You must set up the source as a direct connection. Virtual tables do not support use of intermediary agents.
Ensure that bi-directional connectivity and allowlisting is established as described in the Networking section of this documentation.
If using virtual tables in Code Repositories, refer to the Virtual Tables documentation for details of additional source configuration required.
Use service account key file credentials or Workload Identity Federation (OIDC). Instance-based authentication is not supported with virtual tables.
Temporary tables must be enabled in order to register BigQuery views ↗, and the relevant roles must be added to the authentication credentials.

See the Connection details and Temporary tables section above for more details.

[1] To enable incremental support for pipelines backed by BigQuery virtual tables, ensure that Time Travel ↗ is enabled with the appropriate retention period. This functionality relies on Change History ↗ and is currently append-only. The current and added read modes in Python Transforms are supported. The _CHANGE_TYPE and _CHANGE_TIMESTAMP columns will be made available in Python Transforms.

Troubleshooting

Not found: Dataset `<dataset>` was not found in location `<location>`

BigQuery determines the location where a given query will be run based on either the inputs used in the query or where the results of the query are being stored. When temporary tables are used, the output is set to a temporary table in the temporary tables dataset; the location of this dataset will determine where the query is run. Ensure that all inputs of the sync and the temporary tables dataset are in the same region. If the Automatically create dataset. setting is enabled, use the Google Cloud console or Google's SDKs/APIs to determine the location of the dataset called Palantir_temporary_tables.

gRPC message exceeds maximum size

If syncing data with large content like JSON columns, the transfer may fail with the above error. Adjust BigQuery's Maximum inbound message size in gRPC Settings on the source to increase data transfer in a single API call. Remember, a single call fetches multiple rows, so setting it to the largest row size may not be enough.

You can find this configuration option by navigating to Connection Settings > Connection Details, then scrolling to More options and selecting gRPC Settings.

The gRPC settings setting section in the BigQuery connection settings.

Preview fails to load for a dataset or virtual table based on a BigQuery view

If your preview fails to load and the target BigQuery table is a view, this may be the result of a schema mismatch. Views' schemas are determined by the result of the SQL query defined at the time of creation. Subsequent changes to the underlying table's structure are not automatically applied to the view's schema and may lead to unexpected behavior when trying to consume the view (such as via a sync/virtual table in the Palantir platform.)

You can verify this by comparing the schemas between the view and the underlying table in the BigQuery console, checking for any discrepancies. This can include:

Columns present in the view but not in the underlying table (and vice versa)
Having two different data types for the same column

In the case that any discrepancies are identified, you can address them by recreating or refreshing the view using CREATE OR REPLACE VIEW or ALTER VIEW. For more information, reference this Google Cloud community post ↗ which describes the expected behavior in more detail and provides additional guidance on best practices.

←

PREVIOUSBigCommerce

NEXTBlackbaud Raisers Edge NXT

→