The following export process is an advanced workflow and should only be performed if you are unable to download data directly from the Foundry interface using the Actions menu in Code Repositories or exporting from another Foundry application.
This guide explains how to prepare a CSV for download using transforms in Code Repositories. In some cases, it may be necessary to download a sample of data in CSV format from Foundry without using the Actions menu in the Foundry interface and selecting Download as CSV. In these instances, we recommend preparing the export file during a build as an alternative to using the Actions menu.
The first step in preparing the CSV for download is to create a filtered and cleaned set of data. We recommend performing the following steps:
string
. Since the CSV format lacks a schema (unenforced types and labels for columns), it is recommended to cast all the columns to string. This is especially important for timestamp columns.The following sample code, from the New York taxi dataset, may help you prepare the data for download:
def prepare_input(my_input_df):
from pyspark.sql import functions as F
filter_column = "vendor_id"
filter_value = "CMT"
df_filtered = my_input_df.filter(filter_value == F.col(filter_column))
approx_number_of_rows = 1000
sample_percent = float(approx_number_of_rows) / df_filtered.count()
df_sampled = df_filtered.sample(False, sample_percent, seed=0)
important_columns = ["medallion", "tip_amount"]
return df_sampled.select([F.col(c).cast(F.StringType()).alias(c) for c in important_columns])
Using similar logic and Spark concepts, you can also implement the preparation in other Spark APIs like SQL or Java.
Once the data is prepared for export, you can set the output format to be CSV. By setting the output format to CSV, the underlying format for the data will be saved as CSV files in Foundry. You can also set the output format to be JSON, ORC, Parquet, or text. Finally, to store the result in a single CSV file, you need to coalesce the data to a single partition for download.
The following example code shows how to coalesce the data in Python:
from transforms.api import transform, Input, Output
@transform(
output=Output("/path/to/python_csv"),
my_input=Input("/path/to/input")
)
def my_compute_function(output, my_input):
output.write_dataframe(my_input.dataframe().coalesce(1), output_format="csv", options={"header": "true"})
The following example code shows how to coalesce the data in SQL:
CREATE TABLE `/path/to/sql_csv` USING CSV AS SELECT /*+ COALESCE(1) */ * FROM `/path/to/input`
Review official Spark documentation ↗ for additional CSV generation options.
Once the dataset is built, navigate to the Details tab of the dataset page. The CSV should appear as available for download.