Interact with data

Code Workspaces allows you to read, explore, analyze, transform, and write back to Foundry datasets.

Code Workspaces requires you to select a dataset alias for every Foundry dataset referenced in the workspace. The dataset alias acts as a reference to a specific dataset branch, allowing you to read from and write to it within your code.

When registering a dataset in the Data tab, Code Workspaces creates a mapping between the chosen dataset alias and the Foundry dataset's unique identifier in a hidden file located under the /home/user/repo/.foundry folder of the workspace.

Dataset branching

By default, Code Workspaces will load the data of the dataset from the same branch as the workspace itself, and will otherwise fall back to the master branch. For example, a code workspace currently on branch my-branch will try to read the my-branch version of the dataset, and will fall back to the master branch if my-branch does not exist on the dataset.

Import data

Code Workspaces allows you to load Foundry tabular and non-tabular datasets.

A new dataset can be registered in a code workspace by using the Import Dataset button in the Data tab.

After selecting the desired dataset, you will be prompted to choose between a tabular or non-tabular dataset, as well as optionally modify the dataset alias of the dataset in the workspace. Code Workspaces uses these settings to generate a code snippet that loads the dataset accordingly.

  • These settings can be modified at any time even after registering the dataset.
  • By default, Code Workspaces will suggest a dataset alias with the same name as the dataset itself.

Once your dataset alias is set, select Copy and register dataset to register the dataset in the workspace, which will also save the code snippet to your clipboard.

Finally, paste the code snippet in your workspace and execute the code to have the dataset loaded into memory.

Example code snippets

This page provides examples for tabular datasets and non-tabular datasets.

Tabular datasets

The following snippets are generated for a Cats tabular dataset with a dataset alias kittens. Notice that "Cats" is not referenced anywhere in the code snippet; Code Workspaces implicitly registers it under your chosen dataset alias after selecting Copy and register dataset.

In Jupyter®:

Copied!
1 2 3 4 5 from foundry.transforms import Dataset kittens = Dataset.get("kittens")\ .read_table(format="arrow")\ .to_pandas()

Note that the read_table method shown above supports the following arguments:

  • arrow (recommended): Converts the dataset to an Apache Arrow table on which efficient filtering can be performed. You can then convert this table to a pandas dataframe using .to_pandas().
  • pandas or dataframe: Converts the dataset to a pandas dataframe.
  • polars: Converts the dataset to a Polars dataframe. You can then convert it to a pandas dataframe using .to_pandas().
  • lazy-polars: The lazy variant of a Polar dataframe. Filters cannot be executed on lazy polars.
  • path: Outputs the local path under which the dataset is stored.

For a given format to be available as part of a read_table operation, the corresponding packages pyarrow, pandas and polars must be present in the environment. These are included automatically in the default Code Workspaces environment, but may need to be manually added to your custom environment.

In RStudio®:

Copied!
1 kittens <- datasets.read_table("kittens")

The syntax above loads the dataset and automatically collects the data into an R dataframe.

If the data exceeds the workspace's memory capacity, you can apply push-down filters to only load a subset of rows or columns using the following syntax:

Copied!
1 2 3 4 5 library(dplyr) # should be imported by default in .Rprofile kittens_df <- Dataset.get("kittens") %>% # (optional) apply other transformations before collecting collect()

Non-tabular datasets

The following snippets are generated for a Dogs non-tabular dataset with dataset alias puppies. Notice that Dogs is not referenced anywhere in the code snippet; Code Workspaces implicitly registers it under your chosen name after selecting Copy and register dataset. Contrary to tabular datasets, this gives you access to the files from the dataset in a puppies_files variable instead of inserting values inside a dataframe.

In Jupyter®:

Copied!
1 2 3 4 from foundry.transforms import Dataset puppies = Dataset.get("puppies") puppies_files = puppies.files().download()

In RStudio®:

Copied!
1 2 puppies_files <- datasets.list_files("puppies") puppies_local_files <- datasets.download_files("puppies", puppies_files)

Security

Data security markings applied to datasets are respected in Code Workspaces. Furthermore, a workspace will inherit the markings of all loaded datasets. This means that in order to have access to a code workspace, a user must also have the required permissions to all of the datasets and other inputs contained in the workspace. If you lose access to a single input of the workspace, you will lose access to the entire workspace.

Dataset size limitations

Unlike other tools in Foundry, Code Workspaces does not run by default on Spark. Code Workspaces was designed for quick exploratory analyses rather than for handling very large datasets. Consider using filters to reduce the size of datasets before loading them to prevent issues relating to the memory or CPU limitations of your configured workspace.

Filter dataset files

Code Workspaces enables you to download the files backing any dataset, whether they are tabular (i.e., they have a schema) or non-tabular. It is possible to select a subset of files to download, either by name, or by applying filtering logic on the file metadata (path, size in bytes, transaction RID, and updated time).

In Jupyter®:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 from foundry.transforms import Dataset # Download all files in the dataset downloaded_files = Dataset.get("my_alias").files().download() local_file = downloaded_files["file.pdf"] # Download a single file local_file = Dataset.get("my_alias").files().get("file.pdf").download() # Download all PDF files of less than 1MB in the dataset downloaded_files = my_dataset.files()\ .filter(lambda f: f.path.endswith(".pdf") and f.size_bytes < 1024 * 1024) .download()

When downloading multiple files, we recommend using the filter syntax rather than downloading files individually by name, to leverage parallel downloads.

In RStudio®:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 # Download all files in the dataset all_files <- datasets.list_files("my_alias") downloaded_files <- datasets.download_files("my_alias", all_files) local_file <- downloaded_files$`file.pdf` # Download files by name downloaded_files <- datasets.download_files("my_alias", c("file1.pdf", "file2.pdf")) # Download all PDF files of less than 1MB in the dataset all_files <- datasets.list_files("my_alias") pdf_files <- all_files[sapply(all_files, function(f) f.endswith(".pdf") && f$sizeBytes < 1024*1024)] downloaded_files <- datasets.download_files("my_alias", pdf_files)

In both cases, downloaded_files will be a map from file name as defined in the current dataset view (which may contain slashes) to the local path where the file was downloaded. Note that this local path may change, so it is recommended to rely on the map keys.

Filter tabular datasets

Code Workspaces enables you to apply filters to datasets prior to loading them into memory. This effectively reduces the memory consumption of the dataframes imported into the workspace, enabling you to focus on the pertinent data subset for your analysis. Code Workspaces provides the flexibility to work with a selection of columns, rows, or both.

When to use filters for tabular datasets

As described in dataset size limitations, Code Workspaces is best suited for rapid exploration and iteration cycles rather than for handling extremely large datasets. We recommend the following guidelines when deciding on how to approach loading data into a workspace:

  • For uncompressed data that fits within the workspace's memory, we recommend loading the dataset without filters and applying data transformations in-memory for maximum efficiency.

  • If uncompressed data exceeds the workspace's memory capacity, we recommend using other Foundry tools like Pipeline Builder to preprocess the data, or loading a limited number of rows using a row limit. It is also possible to use column and row filters to load a subset of the data into memory. These push-down filters are applied before data is loaded into the workspace's memory, reducing the memory footprint of the imported data. However, these filters may take a long time to run depending on the scale and partitioning of the data.

  • If you frequently load the same data subset or use it across multiple workspaces, we recommend persisting that subset as its own dataset so the subset can be loaded without filters. Dataset parquet files are only downloaded once, while filtering overhead occurs each time data is loaded into the workspace.

  • For more complex data loading requirements, you can download the files backing a dataset using non-tabular dataset syntax and use native python or R packages to process the file contents.

Row limit

It is possible to only load a limited number of rows from a dataset.

In Jupyter®:

Copied!
1 2 3 4 5 from foundry.transforms import Dataset dogs_subset = Dataset.get("dogs")\ .limit(1000)\ .read_table(format="pandas")

In RStudio®:

Copied!
1 2 3 4 5 library(dplyr) # should be imported by default in .Rprofile dogs_subset <- Dataset.get("dogs") %>% head(1000) %>% collect()

Column filters

All tabular datasets can be loaded into a workspace with a subset of columns. Consider the following dataset as an example:

nameweightcoloragebreed
Bella60Brown4Labrador
Max75Black7German Shepherd
Daisy30White2Poodle

You may want to load this dataset only with the breed and age columns using the syntax below, assuming a dogs dataset was correctly registered into the workspace:

In Jupyter®:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 from foundry.transforms import Dataset, Column # Only load the "breed" and "age" columns columns_to_load = ["breed", "age"] breed_and_age_only = Dataset.get("dogs")\ .select(*columns_to_load)\ .read_table(format="pandas") # Only load the "weight" and "color" columns weight_and_color_only = Dataset.get("dogs")\ .select("weight", "color")\ .read_table(format="pandas")

In RStudio®:

Copied!
1 2 3 4 5 6 library(dplyr) # should be imported by default in .Rprofile # Only load the "weight" and "color" columns weight_and_color_only <- Dataset.get("dogs") %>% select(weight, color) %>% collect()

Row filters

Tabular datasets can also be loaded into a workspace with a subset of rows that meet certain conditions.

Row filters are applicable only to datasets in Parquet format. Other formats, such as CSV, allow for column filters but not row filters. With the help of a Foundry transform, most tabular datasets can be easily converted to Parquet format.

Recall the dogs dataset mentioned earlier:

nameweightcoloragebreed
Bella60Brown4Labrador
Max75Black7German Shepherd
Daisy30White2Poodle
Buddy65Yellow3Labrador
Gizmo18Brown1Pug

Row filter syntax in Jupyter®

The syntax below can be used to filter datasets in Jupyter® at the row level.

You may only load brown-colored dogs from the dogs dataset using the following syntax:

Copied!
1 2 3 4 5 6 from foundry.transforms import Dataset, Column # Only load dogs of color "Brown" brown_dogs = Dataset.get("dogs")\ .where(Column.get("color") == "Brown")\ .read_table(format="pandas")

Notice the use of .where, select, or .limit to pre-filter the dataset before it gets loaded into the workspace. These statements can be chained to apply several conditions at once:

Copied!
1 2 3 4 5 # Only load dogs of color "Brown" and of breed "Labrador" golden_dogs = Dataset.get("dogs")\ .where(Column.get("color") == "Brown")\ .where(Column.get("breed") == "Labrador")\ .read_table(format="pandas")

Below, you can find more examples of acceptable row filtering syntax supported in Jupyter® Code Workspaces:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 # only retain rows equal to a certain value .where(Column.get("column_name") == value) # only retain rows not equal to a certain value .where(Column.get("column_name") != value) # inequality using the ~ operator .where(~Column.get("column_name") == value) # only retain rows whose value is comparable to another value .where(Column.get("column_name") > value) .where(Column.get("column_name") >= value) .where(Column.get("column_name") < value) .where(Column.get("column_name") <= value) # OR / AND operators .where((Column.get("column_name") == value1) | (Column.get("column_name") == value2)) .where((Column.get("column_name1") == value1) & (Column.get("column_name2") == value2)) # only retain rows whose value is not null .where(~Column.get("column_name").isnull()) # only retain rows whose value is part of a given list .where(Column.get("column_name").isin([value1, value2, value3])) # only retain rows whose date is between two given inclusive bounds .where(Column.get("date_column_name").between('lower_bound_incl', 'upper_bound_incl')) # only retain the first N rows, where N is a number. This will be applied before other filters .limit(N) # select a subset of columns .select("column_name1", "column_name2", "column_name3")

Row filter syntax in RStudio®

The syntax below can be used to filter datasets in RStudio® at the row level.

Rstudio filters are implemented through the use of the dplyr library and implement the standard methods filter, select, and head. These filters are pushed down, which means they are applied before the data gets loaded into the memory of the workspace.

You may load only brown-colored dogs from the dogs dataset using the following syntax:

Copied!
1 2 3 4 5 6 library(dplyr) # should be imported by default in .Rprofile # Only load dogs of color "Brown" brown_dogs <- Dataset.get("dogs") %>% foundry::filter(color == "Brown") %>% collect()

Notice the use of foundry::filter to pre-filter the dataset before it gets loaded into the workspace. Technically, the foundry:: prefix is not required, but we recommended to use it in order to avoid potential conflicts with other similarly named filter functions from other packages in your environment. These filter statements can be chained to apply several conditions at once using the %>% operator from the dplyr library. This library should be imported by default in the .Rprofile file of your RStudio workspace.

Copied!
1 2 3 4 5 6 7 library(dplyr) # should be imported by default in .Rprofile # Only load dogs of color "Brown" and of breed "Labrador" brown_labradors <- Dataset.get("dogs") %>% foundry::filter(color == "Brown") %>% foundry::filter(breed == "Labrador") %>% collect()

Below, you can find more examples of acceptable row filtering syntax supported in RStudio® Code Workspaces. Column names must be passed to the foundry::filter function without wrapping them with quotation marks.

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 # only retain rows equal to a certain value foundry::filter(column_name == "string_value") %>% foundry::filter(integer_column_name == 4) %>% # only retain rows not equal to a certain value foundry::filter(column_name != value) %>% # inequality using the ! operator foundry::filter(!(column_name == value)) %>% # only retain rows whose value is comparable to another value foundry::filter(column_name > value) %>% foundry::filter(column_name >= value) %>% foundry::filter(column_name < value) %>% foundry::filter(column_name <= value) %>% # OR / AND operators foundry::filter(column_name == value1 | column_name == value2) %>% foundry::filter(column_name == value1 & column_name == value2) %>% # only retain rows whose value is part of a given list foundry::filter(column_name %in% c("value1", "value2")) %>% # only retain rows whose value is not null foundry::filter(!is.na(column_name)) %>% # only retain rows whose value is between two given inclusive bounds foundry::filter(between(age, 2, 4)) %>% # select a subset of columns select(column_name1, column_name2) %>% # if set, must include all columns used in `filter` clauses # only retain the first N rows, where N is a number. This will be applied before other filters head(N) %>%

Additionally, you may perform advanced data transformations, such as group_by, by temporarily collecting the data as an Arrow table:

Copied!
1 2 3 4 5 6 7 8 9 library(dplyr) # should be imported by default in .Rprofile grouped_dogs <- Dataset.get("alias") %>% # Simple filters can be pushed down foundry::filter(age > 2) %>% collect(as_data_frame = FALSE) %>% # temporarily collect the data as an Arrow table # Advanced transformations need to be applied on the arrow Table group_by(breed) %>% collect()

Column and row filters together

Column filters and row filters can be used together in order to load in a dataset that has both a subset of its columns and a subset of its rows. Using the dogs dataset mentioned earlier:

nameweightcoloragebreed
Bella60Brown4Labrador
Max75Black7German Shepherd
Daisy30White2Poodle
Buddy65Yellow3Labrador
Gizmo18Brown1Pug

The syntax below can be used to get a dataset with the name, breed, and color of brown dogs that exceed a given weight.

In Jupyter®:

Copied!
1 2 3 4 5 6 7 8 # Only load dogs whose color is "Brown" and whose weight is above 62 # Only load the columns "name", "breed", and "color" heavy_brown_dogs = Dataset.get("dogs")\ .where(Column.get("weight") > 62)\ .where(Column.get("color") == "Brown")\ .select("name", "breed", "color")\ .read_table(format="arrow")\ .to_pandas()

In RStudio®:

Copied!
1 2 3 4 5 6 7 8 9 library(dplyr) # should be imported by default in .Rprofile # Only load dogs whose color is "Brown" and whose weight is above 62 # Only load the columns "name", "breed", and "color" heavy_brown_dogs <- Dataset.get("dogs") %>% foundry::filter(weight > 62) %>% foundry::filter(color == "Brown") %>% select("name", "breed", "color") %>% collect()

Write data

To schedule your data transformation, view the data lineage, or write incrementally. You can convert your code to a Jupyter® or RStudio transform.

You can interactively write Foundry datasets with Code Workspaces by following the steps below.

  1. Create a target output Dataset by opening the Data tab and selecting the Save to dataset option, which can be found to the right of Import Dataset.
  2. Select a name for the output dataset as well as a location to save the dataset.
  3. Select Save.
  4. A new dataset will appear in the Data tab. By default, the Save to dataset option will be selected, which should be left as such for output datasets.
  5. You will also be prompted to specify a dataset alias, which will become the name of the output dataset within the workspace, similarly to how aliases work when importing data.
  • For tabular output datasets, you will also be prompted to specify the dataframe variable which will populate the dataset.
  • For non-tabular datasets, you need instead to specify a local file or folder path to upload to the dataset.
  1. Once the dataset type, the dataset alias, and the dataframe variable are set, select Copy and register dataset to register the dataset in the workspace, which will also save the code snippet to your clipboard.
  2. Paste the code snippet in your workspace, replacing the variable as necessary, and execute the code to write to the output dataset.

Transaction types

When writing back interactively, each SDK function call will correspond to one transaction:

  • A SNAPSHOT transaction will be created when writing back tabular data (output_dataset_tabular.write_table(df_variable) in Python or datasets.write_table(df_variable, "output_dataset_tabular") in R).
  • An UPDATE transaction will be created when writing back files (output_dataset_non_tabular.write_file(path_to_file_variable) in Python or datasets.upload_files(path_to_file_variable, "output_dataset_non_tabular") in R).

Once the script has been registered as a transform, interactive calls will start writing to a branch prefixed by code-workspace-sandbox/, while the current branch will be updated when the transform runs. In this case, a single transaction will be created for the full script execution, even if there are multiple SDK function calls:

  • By default, the transaction will be of type SNAPSHOT.
  • If incremental settings have been configured, transaction will be of type APPEND.

Example code snippets

Following the instructions above, assume that two datasets named output_dataset_tabular and output_dataset_non_tabular were created with variables of the same name, and registered in the workspace. Code Workspaces will generate the following code snippets for each dataset based on your chosen variables:

Copied!
1 2 3 4 5 6 7 8 9 10 11 # tabular snippet from foundry.transforms import Dataset output_dataset_tabular = Dataset.get("output_dataset_tabular") output_dataset_tabular.write_table(df_variable) # non-tabular snippet from foundry.transforms import Dataset output_dataset_non_tabular = Dataset.get("output_dataset_non_tabular") output_dataset_non_tabular.write_file(path_to_file_variable)
Copied!
1 2 3 4 5 # tabular snippet datasets.write_table(df_variable, "output_dataset_tabular") # non-tabular snippet datasets.upload_files(path_to_file_variable, "output_dataset_non_tabular")