File-based syncs

After creating a file-based sync using exploration, you can update the configuration in the Configurations tab of the sync page.

Configure file-based syncs

Configuration options for file-based syncs include the following:

ParameterRequired?DefaultDescription
SubfolderYesSpecify the location of files within the connector that will be synced into Foundry.
FiltersNoApply filters to limit the files synced into Foundry.
TransformersNoApply transformers to data before it is synced into Foundry.
Completion strategiesNoEnable to delete files and/or empty parent directories after a successful sync. Requires write permission on the source filesystem.

Syncs will include all nested files and folders from the specified subfolder.

Filters

Filters allow you to filter source files before they are imported into Foundry. The supported filter types are:

  • Exclude files already synced: Only sync files that were added or modified in size or date since the last sync.
  • Path matches: Only sync files with a path (relative to the root of the connector) that matches the regular expression.
  • Path does not match: Only sync files with a path (relative to the root of the connector) that does not match the regular expression.
  • Last modified after: Only sync files that have been modified after a specified date and time.
  • File size is between: Only sync files with a size between the specified minimum and maximum byte value.
  • Any file has path matching: If any file has a relative path matching the regular expression, sync all files in the subfolder that are not otherwise filtered.
  • At least N files: Sync all filtered files only if there are at least N files remaining.
  • Limit number of files: Limit the number of files to keep per transaction. This option can increase the reliability of incremental syncs.

Transformers

Transformers allow you to perform basic file transformations (compression or decryption, for example) before uploading to Foundry. During a sync, the files chosen for ingest will be modified per the chosen transformer.

Rather than using Data Connection transformers, we recommend performing data transformations in Foundry with Pipeline Builder and Code Repositories to benefit from provenance and branching.

The following transformers are supported in Data Connection:

  • Compress with Gzip
  • Concatenate multiple files
    • Join multiple files into a single file.
  • Rename files
    • Replace all occurrences of a given filename substring with a new substring.
    • Drop the directory path from the filename by replacing ^(.*/) with /.
  • Decrypt with PGP
    • Decrypt files that have been encrypted with PGP encryption.
    • Requires that the agent system has PGP keys configured.
    • Unavailable for syncs running on direct connections.
  • Append timestamp to filenames
    • Add a timestamp in a custom format to the filename of each file ingested.

Completion strategies

Completion strategies provide a method of deleting files and empty parent directories after a successful batch sync of those files into a Foundry dataset. This may be useful when data is synced by writing to an intermediate S3 bucket or other file storage system that Foundry reads from. If the data read by Foundry is already a short-lived copy, it is generally safe to delete once the data has been read and successfully written to Foundry.

Limitations of completion strategies and alternatives

Completion strategies are subject to several important limitations and caveats. These limitations and potential mitigations or alternatives are described below.

Completion strategy support

Completion strategies are only supported when using an agent worker runtime. When using a direct connection or agent proxy runtime, we recommend implementing the functionality provided by completion strategies as a downstream external transform instead.

As an example, assume you have a direct connection to an S3 bucket containing the files foo.txt and bar.txt. You want to use a file batch sync to copy them to a dataset, and then delete the files from S3. The recommended way to achieve this doesn't use completion strategies, instead you should do the following:

  • Configure a batch sync without any completion strategies and schedule it to run.
  • Write a downstream external transform job which is scheduled to run when the sync output dataset is updated, taking the synced data as an input.
  • In that external transform, write python transforms code to iterate through the files that have appeared in the synced dataset, and make calls to S3 to delete those files from the bucket.

Note that this approach is retryable if any deletion calls fail, and guarantees that data is successfully committed to Foundry before attempting to perform any deletions. This approach is also compatible with incremental file batch syncs.

Completion strategies are best effort

Completion strategies are best effort, meaning that they do not guarantee that data will be effectively removed. The following are some situations that may cause completion strategies to fail:

  1. Completion strategies will not be retried if the agent worker runtime crashes or is restarted after the batch sync commits data to Foundry, but before the completion strategies run.
  2. If the credentials used to connect do not have write permissions, the batch sync may successfully read data and commit to Foundry, but fail to perform the deletions specified by the completion strategy.

In general, we recommend using an alternative to completion strategies wherever possible. Custom completion strategies are no longer supported.

Optimize file-based syncs

Warning

This guide is recommended for users setting up a new sync or troubleshooting a slow or unreliable sync. If your sync is already working reliably, you do not need to take any action.

Syncing a large number of files into a single dataset can be challenging for many reasons.

Consider a sync intended to upload a million files. After crawling the source system and uploading all but one file, a network issue causes the entire sync to fail. All of the work done up to that point would be lost because syncs are transactional; if the sync fails, the entire transaction also fails.

Network issues are one of several common causes of sync failure, resulting in hours of lost work and unnecessary load on source systems and agents. Even without network issues or errors, syncing a large number of files can take a long time.

If the dataset grows over time, the time to sync the data as a SNAPSHOT increases. This is because SNAPSHOT transactions sync all of the data from the dataset into Foundry. Instead, use syncs that are configured with transaction type APPEND to import your data incrementally. Since you will be syncing smaller, discrete chunks of data, you will create an effective checkpoint; a sync failure will result in a minimal amount of duplicated work rather than requiring a complete re-run. Additionally, your dataset syncs will run faster as you no longer need to upload all of your data for every sync.

Configure incremental APPEND syncs

APPEND transactions require additional configuration to run successfully.

By default, files synced into Foundry are not filtered. However,APPEND syncs require filters to prevent the same files from being imported. We recommend using the Exclude files already synced and Limit number of files filters to control how many files get imported into Foundry in a single sync. Additionally. add a completion strategy to delete files once the sync has successfully completed. Finally, schedule your sync to remain up to date with your source system.