R Filesystem API

R TransformInput object

The interface for low level operations on a Foundry dataset.

spark.df()

data.frame()

fileSystem()

  • Returns a FileSystem object for direct FoundryFS access.

R TransformOutput object

The interface for low level write operations on a Foundry dataset.

write.spark.df(df, partition_cols=NULL, bucket_cols=NULL, bucket_count=NULL, sort_by=NULL)

  • Write the given DataFrame ↗ to the output dataset.

    Parameters
    • df (pyspark.sql.DataFrame) – The PySpark dataframe to write.
    • partition_cols (List[str], optional) - Column partitioning to use when writing data.
    • bucket_cols (List[str], optional) - The columns by which to bucket the data. Must be specified if bucket_count is given.
    • bucket_count (int, optional) – The number of buckets. Must be specified if bucket_cols is given.
    • sort_by (List[str], optional) - The columns by which to sort the bucketed data.

write.data.frame(rdf)

fileSystem()

  • Returns a FileSystem object for direct FoundryFS access.

R FileSystem object

ls(glob=NULL, regex='.*', show_hidden=FALSE)

  • Lists all files matching the given pattern (either glob or regex), with respect to the root directory of the dataset.

    Parameters
    • glob (str, optional) – A unix file matching pattern. Also supports globstar.
    • regex (str, optional) – A regex pattern against which to match filenames.
    • show_hidden (bool, optional) – Include hidden files, those prefixed with ‘.’ or ‘_’.
    ReturnsR array of the FileStatus named tuple (path, size, modified) - The logical path, file size (bytes), modified timestamp (ms since January 1, 1970 UTC)

open(path, open='r', disk_optimal=FALSE, encoding=default)

  • Open a FoundryFS file in the given mode.

    Parameters
    • path (str) – The logical path of the file in the dataset. (Remote path)
    • open (str) - A description of the mode in which to open the connection.
    • disk_optimal (bool, optional) – Controls how FoundryFileSystem handles file i/o.
    • encoding (str, optional) - Defaults to the R language default (UTF-8).
    ReturnsAn R connection object

get_path(path, open='r', disk_optimal=FALSE, encoding=default)

  • For a given FoundryFS (remote) path, returns the local temporary path.

    Parameters
    • path (str) – The logical path of the file in the dataset. (Remote path)
    • open (str) - A description of the mode in which to open the connection.
    • disk_optimal (bool, optional) – Controls how FoundryFileSystem handles file i/o.
    • encoding (str, optional) - Defaults to the R language default (UTF-8).
    Returnsstr

upload(local_path, remote_path)

  • Upload the file from the local to the remote path. Write only.

    Parameters
    • local_path (str) – The local path of the file to upload.
    • remote_path (str) - The logical path of the file in the dataset.
    ReturnsNone

Advanced topic: disk_optimal setting

In the FileSystem methods open() and get_path(), the disk_optimal argument controls how file input and output (i/o) is handled.

By default, disk_optimal is set to FALSE in both open() and get_path(). In this mode, files are guaranteed to be downloaded before they are accessed.

If you choose to set disk_optimal to TRUE, files are downloaded simultaneously while the code executes. The temporary local path must be opened via fifo() in order to read correctly. Note that not all libraries support reading this type of file.

You may choose to set disk_optimal to TRUE when the file you are reading is very large.

For example, let's imagine we have a very large txt file and we only want to read the first 10 lines. Use the below code to print only the first 10 lines, without reading the entire file.

Copied!
1 2 3 4 5 6 7 8 9 10 11 disk_optimal_example<- function(large_txt_file) { fs <- large_txt_file$fileSystem() ## Open a connection with fifo() ## The text file is titled large_txt_file.txt conn <- fs$open("large_txt_file.txt", "r", disk_optimal = TRUE) A <- readLines(conn, n = 10) print(A) return(NULL) }

If you want to use R TransformOutput to write a file and then read it, disk_optimal must be set to false.