Prepare datasets for download

The following export process is an advanced workflow and should only be performed if you are unable to download data directly from the Foundry interface using the Actions menu in Code Repositories or exporting from another Foundry application.

This guide explains how to prepare a CSV for download using transforms in Code Repositories. In some cases, it may be necessary to download a sample of data in CSV format from Foundry without using the Actions menu in the Foundry interface and selecting Download as CSV. In these instances, we recommend preparing the export file during a build as an alternative to using the Actions menu.

Prepare data

The first step in preparing the CSV for download is to create a filtered and cleaned set of data. We recommend performing the following steps:

  1. Ensure the data sample can be exported and follow data export control rules. Specifically, you should verify that the export adheres to the data governance policies of your Organization.
  2. Filter the data to be as small as possible to meet the necessary objectives. Generally, the uncompressed size of the data in CSV format should be less than the default HDFS block size (128 MB). To achieve this, you should only select the necessary columns and minimize the number of rows. You can reduce the number of rows by filtering for specific values, or taking a random sample with an arbitrary number of rows (such as 1000).
  3. Change column type to string. Since the CSV format lacks a schema (unenforced types and labels for columns), it is recommended to cast all the columns to string. This is especially important for timestamp columns.

The following sample code, from the New York taxi dataset, may help you prepare the data for download:

def prepare_input(my_input_df):
    from pyspark.sql import functions as F

    filter_column = "vendor_id"
    filter_value = "CMT"
    df_filtered = my_input_df.filter(filter_value == F.col(filter_column))

    approx_number_of_rows = 1000
    sample_percent = float(approx_number_of_rows) / df_filtered.count()

    df_sampled = df_filtered.sample(False, sample_percent, seed=0)

    important_columns = ["medallion", "tip_amount"]

    return df_sampled.select([F.col(c).cast(F.StringType()).alias(c) for c in important_columns])

Using similar logic and Spark concepts, you can also implement the preparation in other Spark APIs like SQL or Java.

Set output format and coalesce partitions

Once the data is prepared for export, you can set the output format to be CSV. By setting the output format to CSV, the underlying format for the data will be saved as CSV files in Foundry. You can also set the output format to be JSON, ORC, Parquet, or text. Finally, to store the result in a single CSV file, you need to coalesce the data to a single partition for download.

Python

The following example code shows how to coalesce the data in Python:

from transforms.api import transform, Input, Output
@transform(
     output=Output("/path/to/python_csv"),
     my_input=Input("/path/to/input")
)
def my_compute_function(output, my_input):
     output.write_dataframe(my_input.dataframe().coalesce(1), output_format="csv", options={"header": "true"})

SQL

The following example code shows how to coalesce the data in SQL:

CREATE TABLE `/path/to/sql_csv` USING CSV AS SELECT /*+ COALESCE(1) */ * FROM `/path/to/input`

Review official Spark documentation ↗ for additional CSV generation options.

Access the file for download

Once the dataset is built, navigate to the Details tab of the dataset page. The CSV should appear as available for download.

CSV available for download