Interact with data

Jupyter® and RStudio® Code Workspaces allow you to read, analyze, transform, and write back to Foundry datasets. They also allow you to read and analyze restricted views and time series properties.

Security

Code Workspaces respects Foundry markings applied to datasets, and a workspace will inherit the markings of all datasets loaded into it. This is referred to as the workspace lineage. This means that to access a workspace, you must also have the required permissions for all of the datasets and other inputs contained in the workspace. If you lose access to a single input of the workspace, you will lose access to the entire workspace.

Resource considerations

Jupyter® and RStudio® workspaces are designed for interactive, analytical workflows and application development. They run on a single virtual machine and do not process datasets with a distributed Spark environment. As a result, the data you work with must fit into your machine's resources, which default to a maximum of 8 CPUs and 64 GB of memory. Using filters or SQL to reduce the amount of data you need before loading large datasets can help you avoid exceeding these limits. Filtering dataset files is another strategy to stay within your workspace's limits.

To develop large-scale data pipelines, use Python transforms with Spark or Pipeline Builder.

Dataset branching

By default, Code Workspaces will load the data of the dataset from the same branch as the workspace itself, and will otherwise fall back to the master branch. For example, a code workspace currently on branch my-branch will try to read the my-branch version of the dataset, and will fall back to the master branch if my-branch does not exist on the dataset.

To pin a dataset branch for a specific imported dataset, select the options ••• icon next to the dataset in the Data tab of the left sidebar. Then, select Pin dataset branch for reads and choose the desired dataset branch for use within your code workspace.

Read data

Code Workspaces requires you to select an alias for every Foundry dataset or restricted view that you import into a workspace. The alias acts as a reference that allows you to read from a dataset or restricted view and write to a dataset within your code. When registering a dataset in the Data tab, Code Workspaces creates a mapping between the chosen dataset alias and the Foundry resource's unique identifier in a hidden file located under the /home/user/repo/.foundry folder of the workspace.

Code Workspaces allows you to load tabular datasets, non-tabular datasets, and restricted views. You can add a new dataset or restricted view to your workspace by using the Add > Read data button in the Data tab.

After selecting a dataset to add to your workspace, you must define an alias for that resource to serve as a unique identifier and reference in your code. Choosing a read strategy for the dataset (such as pandas DataFrame, Polars LazyFrame, or raw file access) generates a code snippet that loads the dataset as the specified input type.

  • You can modify this read strategy and generate a new snippet at any time, even after registering the dataset.
  • By default, code workspaces will suggest a dataset alias with the same name as the dataset itself.

Selecting Done will complete the dataset registration process and allow you to use the data in your code with the Run snippet or Copy to clipboard options.

This page provides examples for tabular datasets, non-tabular datasets, and restricted views.

Tabular datasets

The following snippets are generated for a Cats tabular dataset with a dataset alias kittens. Notice that Cats is not referenced anywhere in the code snippet; Code Workspaces implicitly registers it under your chosen alias.

In Jupyter®:

Copied!
1 2 3 4 5 from foundry.transforms import Dataset kittens = Dataset.get("kittens")\ .read_table(format="arrow")\ .to_pandas()

Note that the read_table method shown above supports the following arguments:

  • arrow (recommended): Converts the dataset to an Apache Arrow table on which efficient filtering can be performed. You can then convert this table to a pandas dataframe using .to_pandas().
  • pandas or dataframe: Converts the dataset to a pandas dataframe.
  • polars: Converts the dataset to a Polars dataframe. You can then convert it to a pandas dataframe using .to_pandas().
  • lazy-polars: The lazy variant of a Polar dataframe. Filters cannot be executed on lazy polars.
  • path: Outputs the local path under which the dataset is stored.

For a given format to be available as part of a read_table operation, the corresponding packages pyarrow, pandas and polars must be present in the environment. These are included automatically in the default Code Workspaces environment, but may need to be manually added to your custom environment.

In RStudio®:

Copied!
1 kittens <- datasets.read_table("kittens")

The syntax above loads the dataset and automatically collects the data into an R dataframe.

If the data exceeds the workspace's memory capacity, you can apply push-down filters to only load a subset of rows or columns using the following syntax:

Copied!
1 2 3 4 5 library(dplyr) # should be imported by default in .Rprofile kittens_df <- Dataset.get("kittens") %>% # (optional) apply other transformations before collecting collect()

Query tabular datasets with SQL

You may query any tabular dataset that supports SQL queries using SQL in Python.

The following snippets are generated for a Ducks tabular dataset with a dataset alias ducklings. Notice that Ducks is not referenced anywhere in the code snippet; Code Workspaces implicitly registers it under your chosen alias.

You may only query one dataset per SELECT Spark SQL statement using the approaches below. Queries return the results as a PyArrow table ↗ that you can then convert to another format, such as a pandas DataFrame.

  • Note that the default limit for SQL queries is one million rows. If you need to query more than one million rows at a time, you can parallelize separate queries and combine their results. Ensure that your data can fit within your workspace’s memory.

In Jupyter®:

After installing the containers_sql package, you may use the FoundrySdkSqlExecutor to write Spark SQL:

Copied!
1 2 3 4 from containers_sql import FoundrySdkSqlExecutor sql = FoundrySdkSqlExecutor() query = sql.execute("SELECT * from `ducklings`") df = query.fetch_results() # optionally specify timeout arg: query.fetch_results(timeout_in_seconds=60)

You may also use a magic command ↗ by first running the following at the top of your notebook:

Copied!
1 2 import containers_sql %reload_ext containers_sql.foundry_sdk.magics

Then, use either the %sql magic command to run a single-line query or %%sql to run a multi-line query. Using %%sql -o df will assign the result of your SQL query to the python object df as a PyArrow table:

Copied!
1 2 ducklings_df = %sql SELECT hatch_date, bill_color, * from `ducklings` ducklings_pandas_df = ducklings_df.to_pandas()
Copied!
1 2 3 4 5 6 %%sql -o ducklings_df SELECT hatch_date, bill_color, * FROM `ducklings` LIMIT 10;

In RStudio®:

First install containers-sql, foundry-platform-sdk, pyarrow, and reticulate. Then use the reticulate package with the Python SQL executor described above to query a dataset:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 library(reticulate) FoundrySdkSqlExecutor <- import("containers_sql")$FoundrySdkSqlExecutor sql <- FoundrySdkSqlExecutor() query <- sql$execute("SELECT * from `ducklings`") results_df <- query$fetch_results() # For smaller data, creating an intermediate R object is acceptable. r_list <- py_to_r(results_df$to_pydict()) tibble <- as_tibble(r_list) # For larger data, you may want to write your data to disk # to avoid exceeding your container's memory limits. library(arrow) pa <- import("pyarrow") pa$feather$write_feather(results_df, "tmp.arrow") tibble <- read_feather("tmp.arrow") |> as_tibble()

Non-tabular datasets

The following snippets are generated for a Dogs non-tabular dataset with dataset alias puppies. Notice that Dogs is not referenced anywhere in the code snippet; Code Workspaces implicitly registers it under your chosen alias. Contrary to reading tabular datasets, this approach gives you access to the files from the dataset in a puppies_files variable instead of inserting values inside a dataframe.

In Jupyter®:

Copied!
1 2 3 4 from foundry.transforms import Dataset puppies = Dataset.get("puppies") puppies_files = puppies.files().download()

In RStudio®:

Copied!
1 2 puppies_files <- datasets.list_files("puppies") puppies_local_files <- datasets.download_files("puppies", puppies_files)

See Filter dataset files for details on how to target a certain subset of files to download into your workspace.

Restricted views

Querying restricted views is only supported in Jupyter® workspaces.

You can query restricted views using the SQL approach described above using Python. To do this, import a restricted view into your workspace and query it by its alias as with any other tabular dataset.

Note that importing a restricted view into your code workspace requires enabling restricted outputs mode to enforce a higher level of data security. With restricted outputs mode enabled, you can perform analysis on data contained in a restricted view but cannot publish outputs from your workspace that use the data.

When adding a restricted view to your workspace for the first time, a prompt will instruct you to install specific Python dependencies. You can then define an alias (as with a dataset) and add the restricted view to your workspace. You must enable restricted outputs mode and restart your workspace to query the restricted view.

A restricted view's policy determines what data a user can query. For example, if you import a restricted view into a workspace and another user opens that workspace, they may see different data than you. If your access changes while a restricted view is loaded in a workspace, you will lose access to that workspace and will need to restart the workspace.

Filter dataset files

Code Workspaces enables you to download the files backing any dataset, whether they are tabular (i.e., they have a schema) or non-tabular. It is possible to select a subset of files to download, either by name, or by applying filtering logic on the file metadata (path, size in bytes, transaction RID, and updated time).

In Jupyter®:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 from foundry.transforms import Dataset # Download all files in the dataset downloaded_files = Dataset.get("my_alias").files().download() local_file = downloaded_files["file.pdf"] # Download a single file local_file = Dataset.get("my_alias").files().get("file.pdf").download() # Download all PDF files of less than 1MB in the dataset downloaded_files = my_dataset.files()\ .filter(lambda f: f.path.endswith(".pdf") and f.size_bytes < 1024 * 1024) .download()

When downloading multiple files, you should use the filter syntax rather than downloading files individually by name to leverage parallel downloads.

In RStudio®:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 # Download all files in the dataset all_files <- datasets.list_files("my_alias") downloaded_files <- datasets.download_files("my_alias", all_files) local_file <- downloaded_files$`file.pdf` # Download files by name downloaded_files <- datasets.download_files("my_alias", c("file1.pdf", "file2.pdf")) # Download all PDF files of less than 1MB in the dataset all_files <- datasets.list_files("my_alias") pdf_files <- all_files[sapply(all_files, function(f) f.endswith(".pdf") && f$sizeBytes < 1024*1024)] downloaded_files <- datasets.download_files("my_alias", pdf_files)

In both cases, downloaded_files will be a map from file name as defined in the current dataset view (which may contain slashes) to the local path where the file was downloaded. Note that this local path may change, so it is recommended to rely on the map keys.

Filter tabular datasets

Code Workspaces enables you to apply filters to datasets prior to loading them into memory. This reduces the memory consumption of the dataframes imported into the workspace, enabling you to focus on the subset of data relevant for your analysis. Code Workspaces provides the flexibility to work with a selection of columns, rows, or both.

When to use filters for tabular datasets

To ensure that you stay within your workspace's available resources, we recommend the following:

  • When your uncompressed data fits within your workspace's memory, you can load datasets without filters and apply transformations in-memory for maximum efficiency.

  • When your uncompressed data exceeds your workspace's memory, you can use column and row filters to load a subset of the data into memory. These push-down filters are applied before data is loaded into the workspace, reducing the memory footprint of the imported data. The speed of these filter operations depends on the scale and partioning of the data. You can also use SQL with Python to query tabular datasets of any size as long as the resulting data fits within your machine's memory limits.

    • If this approach does not fit your use case, you can also use a Spark-based application to process your data and produce a smaller dataset that you can work with in Code Workspaces. This often speeds up interactive workflows that depend on using the same subset of data frequently or across multiple workspaces, because Parquet files from datasets are downloaded only once while filtering occurs each time data is loaded into the workspace.
  • For more complex data loading requirements, you can download the files backing a dataset using non-tabular dataset syntax and use native Python or R packages to process the file contents.

Row limit

It is possible to only load a limited number of rows from a dataset.

In Jupyter®:

Copied!
1 2 3 4 5 from foundry.transforms import Dataset dogs_subset = Dataset.get("dogs")\ .limit(1000)\ .read_table(format="pandas")

In RStudio®:

Copied!
1 2 3 4 5 library(dplyr) # should be imported by default in .Rprofile dogs_subset <- Dataset.get("dogs") %>% head(1000) %>% collect()

Column filters

All tabular datasets can be loaded into a workspace with a subset of columns. Consider the following dataset as an example:

nameweightcoloragebreed
Bella60Brown4Labrador
Max75Black7German Shepherd
Daisy30White2Poodle

You may want to load this dataset only with the breed and age columns using the syntax below, assuming a dogs dataset was correctly registered into the workspace:

In Jupyter®:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 from foundry.transforms import Dataset, Column # Only load the "breed" and "age" columns columns_to_load = ["breed", "age"] breed_and_age_only = Dataset.get("dogs")\ .select(*columns_to_load)\ .read_table(format="pandas") # Only load the "weight" and "color" columns weight_and_color_only = Dataset.get("dogs")\ .select("weight", "color")\ .read_table(format="pandas")

In RStudio®:

Copied!
1 2 3 4 5 6 library(dplyr) # should be imported by default in .Rprofile # Only load the "weight" and "color" columns weight_and_color_only <- Dataset.get("dogs") %>% select(weight, color) %>% collect()

Row filters

Tabular datasets can also be loaded into a workspace with a subset of rows that meet certain conditions.

Row filters are applicable only to datasets in Parquet format. Other formats, such as CSV, allow for column filters but not row filters. With the help of a Foundry transform, most tabular datasets can be easily converted to Parquet format.

Recall the dogs dataset mentioned earlier:

nameweightcoloragebreed
Bella60Brown4Labrador
Max75Black7German Shepherd
Daisy30White2Poodle
Buddy65Yellow3Labrador
Gizmo18Brown1Pug
Row filter syntax in Jupyter®

The syntax below can be used to filter datasets in Jupyter® at the row level.

You may only load brown-colored dogs from the dogs dataset using the following syntax:

Copied!
1 2 3 4 5 6 from foundry.transforms import Dataset, Column # Only load dogs of color "Brown" brown_dogs = Dataset.get("dogs")\ .where(Column.get("color") == "Brown")\ .read_table(format="pandas")

Notice the use of .where, select, or .limit to pre-filter the dataset before it gets loaded into the workspace. These statements can be chained to apply several conditions at once:

Copied!
1 2 3 4 5 # Only load dogs of color "Brown" and of breed "Labrador" golden_dogs = Dataset.get("dogs")\ .where(Column.get("color") == "Brown")\ .where(Column.get("breed") == "Labrador")\ .read_table(format="pandas")

Below, you can find more examples of acceptable row filtering syntax supported in Jupyter® Code Workspaces:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 # only retain rows equal to a certain value .where(Column.get("column_name") == value) # only retain rows not equal to a certain value .where(Column.get("column_name") != value) # inequality using the ~ operator .where(~(Column.get("column_name") == value)) # only retain rows whose value is comparable to another value .where(Column.get("column_name") > value) .where(Column.get("column_name") >= value) .where(Column.get("column_name") < value) .where(Column.get("column_name") <= value) # OR / AND operators .where((Column.get("column_name") == value1) | (Column.get("column_name") == value2)) .where((Column.get("column_name1") == value1) & (Column.get("column_name2") == value2)) # only retain rows whose value is not null .where(~Column.get("column_name").isnull()) # only retain rows whose value is part of a given list .where(Column.get("column_name").isin([value1, value2, value3])) # only retain rows whose date is between two given inclusive bounds .where(Column.get("date_column_name").between('lower_bound_incl', 'upper_bound_incl')) # only retain the first N rows, where N is a number. This will be applied before other filters .limit(N) # select a subset of columns .select("column_name1", "column_name2", "column_name3")
Row filter syntax in RStudio®

The syntax below can be used to filter datasets in RStudio® at the row level.

Rstudio filters are implemented through the use of the dplyr library and implement the standard methods filter, select, and head. These filters are pushed down, which means they are applied before the data gets loaded into the memory of the workspace.

You may load only brown-colored dogs from the dogs dataset using the following syntax:

Copied!
1 2 3 4 5 6 library(dplyr) # should be imported by default in .Rprofile # Only load dogs of color "Brown" brown_dogs <- Dataset.get("dogs") %>% foundry::filter(color == "Brown") %>% collect()

Notice the use of foundry::filter to pre-filter the dataset before it gets loaded into the workspace. Technically, the foundry:: prefix is not required, but we recommended to use it in order to avoid potential conflicts with other similarly named filter functions from other packages in your environment. These filter statements can be chained to apply several conditions at once using the %>% operator from the dplyr library. This library should be imported by default in the .Rprofile file of your RStudio workspace.

Copied!
1 2 3 4 5 6 7 library(dplyr) # should be imported by default in .Rprofile # Only load dogs of color "Brown" and of breed "Labrador" brown_labradors <- Dataset.get("dogs") %>% foundry::filter(color == "Brown") %>% foundry::filter(breed == "Labrador") %>% collect()

Below, you can find more examples of acceptable row filtering syntax supported in RStudio® Code Workspaces. Column names must be passed to the foundry::filter function without wrapping them with quotation marks.

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 # only retain rows equal to a certain value foundry::filter(column_name == "string_value") %>% foundry::filter(integer_column_name == 4) %>% # only retain rows not equal to a certain value foundry::filter(column_name != value) %>% # inequality using the ! operator foundry::filter(!(column_name == value)) %>% # only retain rows whose value is comparable to another value foundry::filter(column_name > value) %>% foundry::filter(column_name >= value) %>% foundry::filter(column_name < value) %>% foundry::filter(column_name <= value) %>% # OR / AND operators foundry::filter(column_name == value1 | column_name == value2) %>% foundry::filter(column_name == value1 & column_name == value2) %>% # only retain rows whose value is part of a given list foundry::filter(column_name %in% c("value1", "value2")) %>% # only retain rows whose value is not null foundry::filter(!is.na(column_name)) %>% # only retain rows whose value is between two given inclusive bounds foundry::filter(between(age, 2, 4)) %>% # select a subset of columns select(column_name1, column_name2) %>% # if set, must include all columns used in `filter` clauses # only retain the first N rows, where N is a number. This will be applied before other filters head(N) %>%

Additionally, you may perform advanced data transformations, such as group_by, by temporarily collecting the data as an Arrow table:

Copied!
1 2 3 4 5 6 7 8 9 library(dplyr) # should be imported by default in .Rprofile grouped_dogs <- Dataset.get("alias") %>% # Simple filters can be pushed down foundry::filter(age > 2) %>% collect(as_data_frame = FALSE) %>% # temporarily collect the data as an Arrow table # Advanced transformations need to be applied on the arrow Table group_by(breed) %>% collect()

Column and row filters together

Column filters and row filters can be used together in order to load in a dataset that has both a subset of its columns and a subset of its rows. Using the dogs dataset mentioned earlier:

nameweightcoloragebreed
Bella60Brown4Labrador
Max75Black7German Shepherd
Daisy30White2Poodle
Buddy65Yellow3Labrador
Gizmo18Brown1Pug

The syntax below can be used to get a dataset with the name, breed, and color of brown dogs that exceed a given weight.

In Jupyter®:

Copied!
1 2 3 4 5 6 7 8 # Only load dogs whose color is "Brown" and whose weight is above 62 # Only load the columns "name", "breed", and "color" heavy_brown_dogs = Dataset.get("dogs")\ .where(Column.get("weight") > 62)\ .where(Column.get("color") == "Brown")\ .select("name", "breed", "color")\ .read_table(format="arrow")\ .to_pandas()

In RStudio®:

Copied!
1 2 3 4 5 6 7 8 9 library(dplyr) # should be imported by default in .Rprofile # Only load dogs whose color is "Brown" and whose weight is above 62 # Only load the columns "name", "breed", and "color" heavy_brown_dogs <- Dataset.get("dogs") %>% foundry::filter(weight > 62) %>% foundry::filter(color == "Brown") %>% select("name", "breed", "color") %>% collect()

Write data

To schedule your data transformation, view the data lineage, or write incrementally. You can convert your code to a Jupyter® or RStudio transform.

You can interactively write Foundry datasets with Code Workspaces by following the steps below.

  1. Create a target output Dataset by opening the Data tab and selecting the Save to dataset option, which can be found to the right of Import Dataset.
  2. Select a name for the output dataset as well as a location to save the dataset.
  3. Select Save.
  4. A new dataset will appear in the Data tab. By default, the Save to dataset option will be selected, which should be left as such for output datasets.
  5. You will also be prompted to specify a dataset alias, which will become the name of the output dataset within the workspace, similarly to how aliases work when importing data.
  • For tabular output datasets, you will also be prompted to specify the dataframe variable which will populate the dataset.
  • For non-tabular datasets, you need instead to specify a local file or folder path to upload to the dataset.
  1. Once the dataset type, the dataset alias, and the dataframe variable are set, select Copy and register dataset to register the dataset in the workspace, which will also save the code snippet to your clipboard.
  2. Paste the code snippet in your workspace, replacing the variable as necessary, and execute the code to write to the output dataset.

Transaction types

When writing back interactively, each SDK function call will correspond to one transaction, by default:

  • A SNAPSHOT transaction will be created when writing back tabular data (output_dataset_tabular.write_table(df_variable) in Python or datasets.write_table(df_variable, "output_dataset_tabular") in R).
  • An UPDATE transaction will be created when writing back files (output_dataset_non_tabular.upload_directory(path_to_file_variable) in Python or datasets.upload_files(path_to_file_variable, "output_dataset_non_tabular") in R).

Once the script has been registered as a transform, interactive calls will start writing to a branch prefixed by code-workspace-sandbox/, while the current branch will be updated when the transform runs. In this case, a single transaction will be created for the full script execution, even if there are multiple SDK function calls:

  • By default, the transaction will be of type SNAPSHOT.
  • If incremental settings have been configured, transaction will be of type APPEND.

Example code snippets

Following the instructions above, assume that two datasets named output_dataset_tabular and output_dataset_non_tabular were created with variables of the same name, and registered in the workspace. Code Workspaces will generate the following code snippets for each dataset based on your chosen variables:

Copied!
1 2 3 4 5 6 7 8 9 10 11 # tabular snippet from foundry.transforms import Dataset output_dataset_tabular = Dataset.get("output_dataset_tabular") output_dataset_tabular.write_table(df_variable) # non-tabular snippet from foundry.transforms import Dataset output_dataset_non_tabular = Dataset.get("output_dataset_non_tabular") output_dataset_non_tabular.upload_directory(path_to_file_variable)

And in R:

Copied!
1 2 3 4 5 # tabular snippet datasets.write_table(df_variable, "output_dataset_tabular") # non-tabular snippet datasets.upload_files(path_to_file_variable, "output_dataset_non_tabular")