Connect Foundry to the Hadoop Distributed File System (HDFS) to read and sync data from HDFS to Foundry datasets.
Capability | Status |
---|---|
Exploration | 🟢 Generally available |
Bulk import | 🟢 Generally available |
The connector can transfer files of any type into Foundry datasets. File formats are preserved, and no schemas are applied during or after the transfer. Apply any necessary schema to the output dataset, or write a downstream transformation to access the data.
There is no limit to the size of transferable files. However, network issues can result in failures of large-scale transfers. In particular, direct cloud syncs that take more than two days to run will be interrupted. To avoid network issues, we recommend using smaller file sizes and limiting the number of files that are ingested in every execution of the sync. Syncs can be scheduled to run frequently.
Generally, agent-based runtimes are required to connect to HDFS sources unless the cluster is accessible over the internet.
Learn more about setting up a connector in Foundry.
We recommend using the HDFS scheme ↗ if available due to faster RPC performance. Alternatively, WebHDFS ↗ is a HTTP REST API that supports the complete FileSystem interface for HDFS. Some examples include:
The required network ports will differ based on the selected scheme. For the HDFS scheme, these ports are typically 8020/9000 on the NameNode server and 1019, 50010, and 50020 on the DataNode. For the WebHDFS scheme, the required port is typically 9820.
SSL connections validate servers certificates. Normally, SSL validations happen through a certificate chain; by default, both agent and direct connection run times trust most industry standard certificate chains. If the server to which you are connecting has a self-signed certificate, or if there is TLS interception during the validation, the connector must trust the certificate. Learn more about using certificates in Data Connection.
The following configuration options are available for the HDFS connector:
Option | Required? | Description |
---|---|---|
URL | Yes | The HDFS URL to the root data directory |
Extra properties | No | Add a properties map that is passed to the Hadoop Configuration ↗. Each entry is a name and value pair that corresponds to a single property, avoiding the need for specifying the config on disk via configurationResources . |
The following advanced options are available for the HDFS connector:
Option | Required? | Description |
---|---|---|
User | No | HDFS user (defaults to currently logged in user for agent runtimes). The user parameter overrides Data Connection's global Kerberos settings. Leave the user parameter blank if you are using Kerberos. |
File change timeout | No | Amount of time (in ISO-8601 ↗) a file must remain constant before being considered for upload. If possible, use the more efficient lastModifiedBefore processor. |
Visit the Explore
tab to interactively explore data available in the configured HDFS instance. Select New Sync
to regularly pull data from HDFS to a specified dataset in Foundry.