After creating a file-based sync using exploration, you can update the configuration in the Configurations tab of the sync page.
Configuration options for file-based syncs include the following:
Parameter | Required? | Default | Description |
---|---|---|---|
Subfolder | Yes | Specify the location of files within the connector that will be synced into Foundry. | |
Filters | No | Apply filters to limit the files synced into Foundry. | |
Transformers | No | Apply transformers to data before it is synced into Foundry. | |
Completion strategies | No | Enable to delete files and/or empty parent directories after a successful sync. Requires write permission on the source filesystem. |
Syncs will include all nested files and folders from the specified subfolder.
Filters allow you to filter source files before they are imported into Foundry. The supported filter types are:
Transformers allow you to perform basic file transformations (compression or decryption, for example) before uploading to Foundry. During a sync, the files chosen for ingest will be modified per the chosen transformer.
Rather than using Data Connection transformers, we recommend performing data transformations in Foundry with Pipeline Builder and Code Repositories to benefit from provenance and branching.
The following transformers are supported in Data Connection:
^(.*/)
with /
.Completion strategies provide a method of deleting files and empty parent directories after a successful batch sync of those files into a Foundry dataset. This may be useful when data is synced by writing to an intermediate S3 bucket or other file storage system that Foundry reads from. If the data read by Foundry is already a short-lived copy, it is generally safe to delete once the data has been read and successfully written to Foundry.
Completion strategies are subject to several important limitations and caveats. These limitations and potential mitigations or alternatives are described below.
Completion strategies are only supported when using an agent worker runtime. When using a direct connection or agent proxy runtime, we recommend implementing the functionality provided by completion strategies as a downstream external transform instead.
As an example, assume you have a direct connection to an S3 bucket containing the files foo.txt
and bar.txt
. You want to use a file batch sync to copy them to a dataset, and then delete the files from S3. The recommended way to achieve this doesn't use completion strategies, instead you should do the following:
Note that this approach is retryable if any deletion calls fail, and guarantees that data is successfully committed to Foundry before attempting to perform any deletions. This approach is also compatible with incremental file batch syncs.
Completion strategies are best effort, meaning that they do not guarantee that data will be effectively removed. The following are some situations that may cause completion strategies to fail:
In general, we recommend using an alternative to completion strategies wherever possible. Custom completion strategies are no longer supported.
This guide is recommended for users setting up a new sync or troubleshooting a slow or unreliable sync. If your sync is already working reliably, you do not need to take any action.
Syncing a large number of files into a single dataset can be challenging for many reasons.
Consider a sync intended to upload a million files. After crawling the source system and uploading all but one file, a network issue causes the entire sync to fail. All of the work done up to that point would be lost because syncs are transactional; if the sync fails, the entire transaction also fails.
Network issues are one of several common causes of sync failure, resulting in hours of lost work and unnecessary load on source systems and agents. Even without network issues or errors, syncing a large number of files can take a long time.
If the dataset grows over time, the time to sync the data as a SNAPSHOT
increases. This is because SNAPSHOT
transactions sync all of the data from the dataset into Foundry. Instead, use syncs that are configured with transaction type APPEND
to import your data incrementally. Since you will be syncing smaller, discrete chunks of data, you will create an effective checkpoint; a sync failure will result in a minimal amount of duplicated work rather than requiring a complete re-run. Additionally, your dataset syncs will run faster as you no longer need to upload all of your data for every sync.
APPEND
syncsAPPEND
transactions require additional configuration to run successfully.
By default, files synced into Foundry are not filtered. However,APPEND
syncs require filters to prevent the same files from being imported. We recommend using the Exclude files already synced
and Limit number of files
filters to control how many files get imported into Foundry in a single sync. Additionally. add a completion strategy to delete files once the sync has successfully completed. Finally, schedule your sync to remain up to date with your source system.