Code Workspaces allows you to read, explore, analyze, transform, and write back to Foundry datasets.
Code Workspaces requires you to select a dataset alias for every Foundry dataset referenced in the workspace. The dataset alias acts as a reference to a specific dataset branch, allowing you to read from and write to it within your code.
When registering a dataset in the Data tab, Code Workspaces creates a mapping between the chosen dataset alias and the Foundry dataset's unique identifier in a hidden file located under the /home/user/repo/.foundry
folder of the workspace.
By default, Code Workspaces will load the data of the dataset from the same branch as the workspace itself, and will otherwise fall back to the master
branch. For example, a code workspace currently on branch my-branch
will try to read the my-branch
version of the dataset, and will fall back to the master
branch if my-branch
does not exist on the dataset.
To pin a dataset branch for a specific imported dataset, select the options ••• icon next to the dataset in the Data tab of the left sidebar. Then, select Pin dataset branch for reads and choose the desired dataset branch for use within your code workspace.
Code Workspaces allows you to load Foundry tabular and non-tabular datasets.
A new dataset can be registered in a code workspace by using the Import Dataset button in the Data tab.
After selecting the desired dataset, you will be prompted to choose between a tabular or non-tabular dataset, as well as optionally modify the dataset alias of the dataset in the workspace. Code Workspaces uses these settings to generate a code snippet that loads the dataset accordingly.
Once your dataset alias is set, select Copy and register dataset to register the dataset in the workspace, which will also save the code snippet to your clipboard.
Finally, paste the code snippet in your workspace and execute the code to have the dataset loaded into memory.
This page provides examples for tabular datasets and non-tabular datasets.
The following snippets are generated for a Cats
tabular dataset with a dataset alias kittens
. Notice that "Cats" is not referenced anywhere in the code snippet; Code Workspaces implicitly registers it under your chosen dataset alias after selecting Copy and register dataset.
In Jupyter®:
Copied!1 2 3 4 5
from foundry.transforms import Dataset kittens = Dataset.get("kittens")\ .read_table(format="arrow")\ .to_pandas()
Note that the read_table
method shown above supports the following arguments:
arrow
(recommended): Converts the dataset to an Apache Arrow table on which efficient filtering can be performed. You can then convert this table to a pandas dataframe using .to_pandas()
.pandas
or dataframe
: Converts the dataset to a pandas dataframe.polars
: Converts the dataset to a Polars dataframe. You can then convert it to a pandas dataframe using .to_pandas()
.lazy-polars
: The lazy variant of a Polar dataframe. Filters cannot be executed on lazy polars.path
: Outputs the local path under which the dataset is stored.For a given format to be available as part of a read_table
operation, the corresponding packages pyarrow
, pandas
and polars
must be present in the environment. These are included automatically in the default Code Workspaces environment, but may need to be manually added to your custom environment.
In RStudio®:
Copied!1
kittens <- datasets.read_table("kittens")
The syntax above loads the dataset and automatically collects the data into an R dataframe.
If the data exceeds the workspace's memory capacity, you can apply push-down filters to only load a subset of rows or columns using the following syntax:
Copied!1 2 3 4 5
library(dplyr) # should be imported by default in .Rprofile kittens_df <- Dataset.get("kittens") %>% # (optional) apply other transformations before collecting collect()
The following snippets are generated for a Dogs
non-tabular dataset with dataset alias puppies
. Notice that Dogs
is not referenced anywhere in the code snippet; Code Workspaces implicitly registers it under your chosen name after selecting Copy and register dataset. Contrary to tabular datasets, this gives you access to the files from the dataset in a puppies_files
variable instead of inserting values inside a dataframe.
In Jupyter®:
Copied!1 2 3 4
from foundry.transforms import Dataset puppies = Dataset.get("puppies") puppies_files = puppies.files().download()
In RStudio®:
Copied!1 2
puppies_files <- datasets.list_files("puppies") puppies_local_files <- datasets.download_files("puppies", puppies_files)
Data security markings applied to datasets are respected in Code Workspaces. Furthermore, a workspace will inherit the markings of all loaded datasets. This means that in order to have access to a code workspace, a user must also have the required permissions to all of the datasets and other inputs contained in the workspace. If you lose access to a single input of the workspace, you will lose access to the entire workspace.
Unlike other tools in Foundry, Code Workspaces does not run by default on Spark. Code Workspaces was designed for quick exploratory analyses rather than for handling very large datasets. Consider using filters to reduce the size of datasets before loading them to prevent issues relating to the memory or CPU limitations of your configured workspace.
Code Workspaces enables you to download the files backing any dataset, whether they are tabular (i.e., they have a schema) or non-tabular. It is possible to select a subset of files to download, either by name, or by applying filtering logic on the file metadata (path, size in bytes, transaction RID, and updated time).
In Jupyter®:
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13
from foundry.transforms import Dataset # Download all files in the dataset downloaded_files = Dataset.get("my_alias").files().download() local_file = downloaded_files["file.pdf"] # Download a single file local_file = Dataset.get("my_alias").files().get("file.pdf").download() # Download all PDF files of less than 1MB in the dataset downloaded_files = my_dataset.files()\ .filter(lambda f: f.path.endswith(".pdf") and f.size_bytes < 1024 * 1024) .download()
When downloading multiple files, we recommend using the filter syntax rather than downloading files individually by name, to leverage parallel downloads.
In RStudio®:
Copied!1 2 3 4 5 6 7 8 9 10 11 12
# Download all files in the dataset all_files <- datasets.list_files("my_alias") downloaded_files <- datasets.download_files("my_alias", all_files) local_file <- downloaded_files$`file.pdf` # Download files by name downloaded_files <- datasets.download_files("my_alias", c("file1.pdf", "file2.pdf")) # Download all PDF files of less than 1MB in the dataset all_files <- datasets.list_files("my_alias") pdf_files <- all_files[sapply(all_files, function(f) f.endswith(".pdf") && f$sizeBytes < 1024*1024)] downloaded_files <- datasets.download_files("my_alias", pdf_files)
In both cases, downloaded_files
will be a map from file name as defined in the current dataset view (which may contain slashes) to the local path where the file was downloaded. Note that this local path may change, so it is recommended to rely on the map keys.
Code Workspaces enables you to apply filters to datasets prior to loading them into memory. This effectively reduces the memory consumption of the dataframes imported into the workspace, enabling you to focus on the pertinent data subset for your analysis. Code Workspaces provides the flexibility to work with a selection of columns, rows, or both.
As described in dataset size limitations, Code Workspaces is best suited for rapid exploration and iteration cycles rather than for handling extremely large datasets. We recommend the following guidelines when deciding on how to approach loading data into a workspace:
For uncompressed data that fits within the workspace's memory, we recommend loading the dataset without filters and applying data transformations in-memory for maximum efficiency.
If uncompressed data exceeds the workspace's memory capacity, we recommend using other Foundry tools like Pipeline Builder to preprocess the data, or loading a limited number of rows using a row limit. It is also possible to use column and row filters to load a subset of the data into memory. These push-down filters are applied before data is loaded into the workspace's memory, reducing the memory footprint of the imported data. However, these filters may take a long time to run depending on the scale and partitioning of the data.
If you frequently load the same data subset or use it across multiple workspaces, we recommend persisting that subset as its own dataset so the subset can be loaded without filters. Dataset parquet files are only downloaded once, while filtering overhead occurs each time data is loaded into the workspace.
For more complex data loading requirements, you can download the files backing a dataset using non-tabular dataset syntax and use native python
or R
packages to process the file contents.
It is possible to only load a limited number of rows from a dataset.
In Jupyter®:
Copied!1 2 3 4 5
from foundry.transforms import Dataset dogs_subset = Dataset.get("dogs")\ .limit(1000)\ .read_table(format="pandas")
In RStudio®:
Copied!1 2 3 4 5
library(dplyr) # should be imported by default in .Rprofile dogs_subset <- Dataset.get("dogs") %>% head(1000) %>% collect()
All tabular datasets can be loaded into a workspace with a subset of columns. Consider the following dataset as an example:
name | weight | color | age | breed |
---|---|---|---|---|
Bella | 60 | Brown | 4 | Labrador |
Max | 75 | Black | 7 | German Shepherd |
Daisy | 30 | White | 2 | Poodle |
You may want to load this dataset only with the breed
and age
columns using the syntax below, assuming a dogs
dataset was correctly registered into the workspace:
In Jupyter®:
Copied!1 2 3 4 5 6 7 8 9 10 11 12
from foundry.transforms import Dataset, Column # Only load the "breed" and "age" columns columns_to_load = ["breed", "age"] breed_and_age_only = Dataset.get("dogs")\ .select(*columns_to_load)\ .read_table(format="pandas") # Only load the "weight" and "color" columns weight_and_color_only = Dataset.get("dogs")\ .select("weight", "color")\ .read_table(format="pandas")
In RStudio®:
Copied!1 2 3 4 5 6
library(dplyr) # should be imported by default in .Rprofile # Only load the "weight" and "color" columns weight_and_color_only <- Dataset.get("dogs") %>% select(weight, color) %>% collect()
Tabular datasets can also be loaded into a workspace with a subset of rows that meet certain conditions.
Row filters are applicable only to datasets in Parquet format. Other formats, such as CSV, allow for column filters but not row filters. With the help of a Foundry transform, most tabular datasets can be easily converted to Parquet format.
Recall the dogs
dataset mentioned earlier:
name | weight | color | age | breed |
---|---|---|---|---|
Bella | 60 | Brown | 4 | Labrador |
Max | 75 | Black | 7 | German Shepherd |
Daisy | 30 | White | 2 | Poodle |
Buddy | 65 | Yellow | 3 | Labrador |
Gizmo | 18 | Brown | 1 | Pug |
The syntax below can be used to filter datasets in Jupyter® at the row level.
You may only load brown-colored dogs from the dogs
dataset using the following syntax:
Copied!1 2 3 4 5 6
from foundry.transforms import Dataset, Column # Only load dogs of color "Brown" brown_dogs = Dataset.get("dogs")\ .where(Column.get("color") == "Brown")\ .read_table(format="pandas")
Notice the use of .where
, select
, or .limit
to pre-filter the dataset before it gets loaded into the workspace. These statements can be chained to apply several conditions at once:
Copied!1 2 3 4 5
# Only load dogs of color "Brown" and of breed "Labrador" golden_dogs = Dataset.get("dogs")\ .where(Column.get("color") == "Brown")\ .where(Column.get("breed") == "Labrador")\ .read_table(format="pandas")
Below, you can find more examples of acceptable row filtering syntax supported in Jupyter® Code Workspaces:
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
# only retain rows equal to a certain value .where(Column.get("column_name") == value) # only retain rows not equal to a certain value .where(Column.get("column_name") != value) # inequality using the ~ operator .where(~Column.get("column_name") == value) # only retain rows whose value is comparable to another value .where(Column.get("column_name") > value) .where(Column.get("column_name") >= value) .where(Column.get("column_name") < value) .where(Column.get("column_name") <= value) # OR / AND operators .where((Column.get("column_name") == value1) | (Column.get("column_name") == value2)) .where((Column.get("column_name1") == value1) & (Column.get("column_name2") == value2)) # only retain rows whose value is not null .where(~Column.get("column_name").isnull()) # only retain rows whose value is part of a given list .where(Column.get("column_name").isin([value1, value2, value3])) # only retain rows whose date is between two given inclusive bounds .where(Column.get("date_column_name").between('lower_bound_incl', 'upper_bound_incl')) # only retain the first N rows, where N is a number. This will be applied before other filters .limit(N) # select a subset of columns .select("column_name1", "column_name2", "column_name3")
The syntax below can be used to filter datasets in RStudio® at the row level.
Rstudio filters are implemented through the use of the dplyr
library and implement the standard methods filter
, select
, and head
. These filters are pushed down, which means they are applied before the data gets loaded into the memory of the workspace.
You may load only brown-colored dogs from the dogs
dataset using the following syntax:
Copied!1 2 3 4 5 6
library(dplyr) # should be imported by default in .Rprofile # Only load dogs of color "Brown" brown_dogs <- Dataset.get("dogs") %>% foundry::filter(color == "Brown") %>% collect()
Notice the use of foundry::filter
to pre-filter the dataset before it gets loaded into the workspace. Technically, the foundry::
prefix is not required, but we recommended to use it in order to avoid potential conflicts with other similarly named filter
functions from other packages in your environment. These filter
statements can be chained to apply several conditions at once using the %>%
operator from the dplyr
library. This library should be imported by default in the .Rprofile
file of your RStudio workspace.
Copied!1 2 3 4 5 6 7
library(dplyr) # should be imported by default in .Rprofile # Only load dogs of color "Brown" and of breed "Labrador" brown_labradors <- Dataset.get("dogs") %>% foundry::filter(color == "Brown") %>% foundry::filter(breed == "Labrador") %>% collect()
Below, you can find more examples of acceptable row filtering syntax supported in RStudio® Code Workspaces. Column names must be passed to the foundry::filter
function without wrapping them with quotation marks.
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
# only retain rows equal to a certain value foundry::filter(column_name == "string_value") %>% foundry::filter(integer_column_name == 4) %>% # only retain rows not equal to a certain value foundry::filter(column_name != value) %>% # inequality using the ! operator foundry::filter(!(column_name == value)) %>% # only retain rows whose value is comparable to another value foundry::filter(column_name > value) %>% foundry::filter(column_name >= value) %>% foundry::filter(column_name < value) %>% foundry::filter(column_name <= value) %>% # OR / AND operators foundry::filter(column_name == value1 | column_name == value2) %>% foundry::filter(column_name == value1 & column_name == value2) %>% # only retain rows whose value is part of a given list foundry::filter(column_name %in% c("value1", "value2")) %>% # only retain rows whose value is not null foundry::filter(!is.na(column_name)) %>% # only retain rows whose value is between two given inclusive bounds foundry::filter(between(age, 2, 4)) %>% # select a subset of columns select(column_name1, column_name2) %>% # if set, must include all columns used in `filter` clauses # only retain the first N rows, where N is a number. This will be applied before other filters head(N) %>%
Additionally, you may perform advanced data transformations, such as group_by
, by temporarily collecting the data as an Arrow
table:
Copied!1 2 3 4 5 6 7 8 9
library(dplyr) # should be imported by default in .Rprofile grouped_dogs <- Dataset.get("alias") %>% # Simple filters can be pushed down foundry::filter(age > 2) %>% collect(as_data_frame = FALSE) %>% # temporarily collect the data as an Arrow table # Advanced transformations need to be applied on the arrow Table group_by(breed) %>% collect()
Column filters and row filters can be used together in order to load in a dataset that has both a subset of its columns and a subset of its rows. Using the dogs
dataset mentioned earlier:
name | weight | color | age | breed |
---|---|---|---|---|
Bella | 60 | Brown | 4 | Labrador |
Max | 75 | Black | 7 | German Shepherd |
Daisy | 30 | White | 2 | Poodle |
Buddy | 65 | Yellow | 3 | Labrador |
Gizmo | 18 | Brown | 1 | Pug |
The syntax below can be used to get a dataset with the name, breed, and color of brown dogs that exceed a given weight.
In Jupyter®:
Copied!1 2 3 4 5 6 7 8
# Only load dogs whose color is "Brown" and whose weight is above 62 # Only load the columns "name", "breed", and "color" heavy_brown_dogs = Dataset.get("dogs")\ .where(Column.get("weight") > 62)\ .where(Column.get("color") == "Brown")\ .select("name", "breed", "color")\ .read_table(format="arrow")\ .to_pandas()
In RStudio®:
Copied!1 2 3 4 5 6 7 8 9
library(dplyr) # should be imported by default in .Rprofile # Only load dogs whose color is "Brown" and whose weight is above 62 # Only load the columns "name", "breed", and "color" heavy_brown_dogs <- Dataset.get("dogs") %>% foundry::filter(weight > 62) %>% foundry::filter(color == "Brown") %>% select("name", "breed", "color") %>% collect()
To schedule your data transformation, view the data lineage, or write incrementally. You can convert your code to a Jupyter® or RStudio transform.
You can interactively write Foundry datasets with Code Workspaces by following the steps below.
When writing back interactively, each SDK function call will correspond to one transaction, by default:
SNAPSHOT
transaction will be created when writing back tabular data (output_dataset_tabular.write_table(df_variable)
in Python or datasets.write_table(df_variable, "output_dataset_tabular")
in R).UPDATE
transaction will be created when writing back files (output_dataset_non_tabular.upload_directory(path_to_file_variable)
in Python or datasets.upload_files(path_to_file_variable, "output_dataset_non_tabular")
in R).Once the script has been registered as a transform, interactive calls will start writing to a branch prefixed by code-workspace-sandbox/
, while the current branch will be updated when the transform runs. In this case, a single transaction will be created for the full script execution, even if there are multiple SDK function calls:
SNAPSHOT
.APPEND
.Following the instructions above, assume that two datasets named output_dataset_tabular
and output_dataset_non_tabular
were created with variables of the same name, and registered in the workspace. Code Workspaces will generate the following code snippets for each dataset based on your chosen variables:
Copied!1 2 3 4 5 6 7 8 9 10 11
# tabular snippet from foundry.transforms import Dataset output_dataset_tabular = Dataset.get("output_dataset_tabular") output_dataset_tabular.write_table(df_variable) # non-tabular snippet from foundry.transforms import Dataset output_dataset_non_tabular = Dataset.get("output_dataset_non_tabular") output_dataset_non_tabular.upload_directory(path_to_file_variable)
And in R:
Copied!1 2 3 4 5
# tabular snippet datasets.write_table(df_variable, "output_dataset_tabular") # non-tabular snippet datasets.upload_files(path_to_file_variable, "output_dataset_non_tabular")