In addition to operating over Foundry datasets that have a defined tabular schema, Code Workbook supports accessing unstructured files in a dataset. This can be useful for analyzing and transforming unstructured data such as images and other types of media, semi-structured formats such as XML or JSON, compressed formats such as GZ or ZIP files, or R data formats like RDA and RDS.
You can read files in a Python transform by reading an upstream dataset as a Python transform input
. This API exposes a FileSystem
object that allows file access based on the path of a file within the Foundry dataset, abstracting away the underlying storage. Learn more about the FileSystem
.. Other information, including the branch and RID (as detailed in the transform input documentation), is also exposed.
Change the type of your input using the input helper bar, or in the inputs tab.
Only imported datasets and persisted datasets can be read in as Python transform inputs. Transforms that are not saved as a dataset cannot be read in as Python transform inputs.
Datasets with no schema should be read in as a transform input automatically.
For example, the following code will read the CSVs inside of a ZIP file and return the CSV contents as a dataframe.
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
import tempfile import zipfile import shutil import io from pyspark.sql import Row # datasetOfZippedFiles is a dataset with a single zipped file that contains 3 CSVs with the same schema: ["id", name"]. def sample(datasetOfZippedFiles): df = datasetOfZippedFiles fs = df.filesystem() # This is the FileSystem object. MyRow = Row("id", "name") def process_file(file_status): with fs.open(file_status.path, 'rb') as f: with tempfile.NamedTemporaryFile() as tmp: shutil.copyfileobj(f, tmp) tmp.flush() with zipfile.ZipFile(tmp) as archive: for filename in archive.namelist(): with archive.open(filename) as f2: br = io.BufferedReader(f2) tw = io.TextIOWrapper(br) tw.readline() # Skip the first line of each CSV for line in tw: yield MyRow(*line.split(",")) rdd = fs.files().rdd rdd = rdd.flatMap(process_file) df = rdd.toDF() return df
It is possible to write to an output FileSystem. This can be useful to write non-tabular data formats including images, PDFs, text files, etc.
Call Transforms.get_output()
to instantiate a TransformOutput. Learn more about the TransformOutput API.
You can only write files using TransformOutput in nodes that are saved as datasets. You cannot write files using TransformOutput in the console.
Once you have instantiated a TransformOutput and used it by calling filesystem() or other methods, returning anything other than the TransformOutput object will be ignored.
The following code is an example of how to write a text file:
Copied!1 2 3 4 5 6 7 8
def write_text_file(): output = Transforms.get_output() output_fs = output.filesystem() with output_fs.open('my text file.txt', 'w') as f: f.write("Hello world") f.close()
The following code is an example of how to write a dataset and specify partitioning and output format.
Copied!1 2 3
def write_dataset(input_dataset): output = Transforms.get_output() output.write_dataframe(input_dataset, partition_cols = ["colA", "colB"], output_format = 'csv')
You can read files in an R transform by reading an upstream dataset as an R transform input
. The TransformInput
object is exposed which allows file access based on the path of a file within the Foundry dataset. Learn more about the FileSystem
API.
Change the type of your input using the input helper bar, or in the inputs tab.
Only imported datasets and persisted datasets can be read in as R transform inputs. Transforms that are not saved as a dataset cannot be read in as R transform inputs.
By default, datasets without schemas should be set to input type R transform input already.
Use the code below to load an RDS that is a file in an imported dataset. The RDS contains an R data.frame.
Copied!1 2 3 4 5 6 7 8 9
RDS_reader <- function(RDS_dataset) { fs <- RDS_dataset$fileSystem() ## The name of the file is test_loading_RDS.rds path <- fs$get_path("test_loading_RDS.rds", 'r') rds <- readRDS(path) return(rds) }
Use the code below to rbind
the contents of a set of zipped CSVs.
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
result <- function(zip_file_with_csvs) { fs <- zip_file_with_csvs$fileSystem() ## Get the remote path (name) of the zipfile zipfile_name <- fs$ls()[[1]]$path ## Get the local path of the zipfile path <- fs$get_path(zipfile_name, 'r') # List the zipped files zipped_files <- as.list(unzip(path, list = TRUE)$Name) # For every element on the list, return a dataframe list_of_data_frames <- lapply(zipped_files, function(x){read.csv(unz(path, x), header = TRUE, sep = ",")}) # Bind all of the dataframes together rbind_df <- do.call(rbind,list_of_data_frames) return(rbind_df) }
It is possible to write to an output FileSystem. This can be useful to write non-tabular data formats including images, PDFs, text files, and so on.
Call new.output()
to instantiate a TransformOutput. Learn more about the FileSystem
API.
You can only write files using TransformOutput in nodes that are saved as datasets. You cannot write files using TransformOutput in the console.
Use the code below to save an R data.frame to an RDS file.
Copied!1 2 3 4 5 6
write_rds_file <- function(r_dataframe) { output <- new.output() output_fs <- output$fileSystem() saveRDS(r_dataframe, output_fs$get_path("my_RDS_file.rds", 'w')) }
Use the code below to save a plot to a PDF.
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
plot_pdf <- function() { library(ggplot2) theme_set(theme_bw()) # pre-set the bw theme data("midwest", package = "ggplot2") # Scatterplot gg <- ggplot(midwest, aes(x=area, y=poptotal)) + geom_point(aes(col=state, size=popdensity)) + geom_smooth(method="loess", se=F) + xlim(c(0, 0.1)) + ylim(c(0, 500000)) + labs(subtitle="Area Vs Population", y="Population", x="Area", title="Scatterplot", caption = "Source: midwest") output <- new.output() output_fs <- output$fileSystem() pdf(output_fs$get_path("my pdf example.pdf", 'w')) plot(gg) }
Use the code below to write a TXT file using a connection.
Copied!1 2 3 4 5 6
write_txt_file <- function() { output <- new.output() output_fs <- output$fileSystem() conn <- output_fs$open("my file.txt", 'w') writeLines(c("Hello", "world"), conn) }
Use the code below to take the text file at the local path output.txt
, and upload it to the remote path output_test.txt
. In the saved dataset, you will see one file named output_test.txt
Copied!1 2 3 4 5 6 7 8 9
upload <- function() { output <- new.output() output_fs <- output$fileSystem() fileConn<-file("output.txt") writeLines(c("Header 1"), fileConn) close(fileConn) output_fs$upload("output.txt", "output_test.txt") }
Copied!1 2 3 4 5 6
write_txt_file <- function() { output <- new.output() output_fs <- output$fileSystem() conn <- output_fs$open("my file.txt", 'w') writeLines(c("Hello", "world"), conn) }
Use the code below to write a Spark dataframe that is partitioned by columns A and B.
Copied!1 2 3 4 5 6
write_partitioned_df <- function(spark_df) { output <- new.output() # partition on colA and colB output$write.spark.df(spark_df, partition_cols=list("colA", "colB")) }