HDFS

Connect Foundry to the Hadoop Distributed File System (HDFS) to read and sync data from HDFS to Foundry datasets.

Supported capabilities

CapabilityStatus
Exploration🟢 Generally available
Bulk import🟢 Generally available

Data model

The connector can transfer files of any type into Foundry datasets. File formats are preserved, and no schemas are applied during or after the transfer. Apply any necessary schema to the output dataset, or write a downstream transformation to access the data.

Performance and limitations

There is no limit to the size of transferable files. However, network issues can result in failures of large-scale transfers. In particular, direct cloud syncs that take more than two days to run will be interrupted. To avoid network issues, we recommend using smaller file sizes and limiting the number of files that are ingested in every execution of the sync. Syncs can be scheduled to run frequently.

Generally, agent-based runtimes are required to connect to HDFS sources unless the cluster is accessible over the internet.

Setup

  1. Open the Data Connection application and select + New Source in the upper right corner of the screen.
  2. Select HDFS from the available connector types.
  3. Choose to use a direct connection over the Internet or to connect through an intermediary agent.
  4. Follow the additional configuration prompts to continue the setup of your connector using the information in the sections below.

Learn more about setting up a connector in Foundry.

Networking

We recommend using the HDFS scheme ↗ if available due to faster RPC performance. Alternatively, WebHDFS ↗ is a HTTP REST API that supports the complete FileSystem interface for HDFS. Some examples include:

  • hdfs://myhost.example.com:1234/path/to/root/directory
  • webhdfs://example.com/path
  • swebhdfs://example.com/path

The required network ports will differ based on the selected scheme. For the HDFS scheme, these ports are typically 8020/9000 on the NameNode server and 1019, 50010, and 50020 on the DataNode. For the WebHDFS scheme, the required port is typically 9820.

Certificates and private keys

SSL connections validate servers certificates. Normally, SSL validations happen through a certificate chain; by default, both agent and direct connection run times trust most industry standard certificate chains. If the server to which you are connecting has a self-signed certificate, or if there is TLS interception during the validation, the connector must trust the certificate. Learn more about using certificates in Data Connection.

Configuration options

The following configuration options are available for the HDFS connector:

OptionRequired?Description
URLYesThe HDFS URL to the root data directory
Extra propertiesNoAdd a properties map that is passed to the Hadoop Configuration ↗. Each entry is a name and value pair that corresponds to a single property, avoiding the need for specifying the config on disk via configurationResources.

Advanced options

The following advanced options are available for the HDFS connector:

OptionRequired?Description
UserNoHDFS user (defaults to currently logged in user for agent runtimes).
The user parameter overrides Data Connection's global Kerberos settings. Leave the user parameter blank if you are using Kerberos.
File change timeoutNoAmount of time (in ISO-8601 ↗) a file must remain constant before being considered for upload.
If possible, use the more efficient lastModifiedBefore processor.

Sync data from HDFS

Visit the Explore tab to interactively explore data available in the configured HDFS instance. Select New Sync to regularly pull data from HDFS to a specified dataset in Foundry.