Search documentation
karat

+

K

User Documentation ↗

foundry.transforms.Dataset

class foundry.transforms.Dataset(alias)

A class representing the files backing a Foundry dataset view.

Prefer using the static Dataset.get() factory method instead of calling the constructor directly.

static method get(alias)

Create a new Dataset instance for the given alias.

  • Parameters: alias (str) – The alias of the dataset.
  • Returns: A new Dataset instance.
  • Return type: Dataset

property alias

The alias of the dataset.

property schema

The Foundry field schema of the dataset.

  • Type: FoundryFieldSchema

property write_table_path

The path on disk for the dataset files to be used with write_table.

property lazy_write_table_path

An object store path to a bucket that will be mapped into the output transaction.

read_table(columns=None, row_limit=None, format='dataframe', mode='current', force_dataset_download=False, schema=None)

Read a tabular Foundry dataset as a pandas DataFrame, Polars DataFrame, Arrow Table, or raw file path.

  • Parameters:
    • columns (List [str ] , optional) – The subset of columns to read.
    • row_limit (int , optional) – The maximum number of rows to read.
    • format (str , optional) – The output type. One of "arrow", "pandas", "dataframe" (alias for pandas, default), "polars", "lazy-polars", or "path". When set to "path", a path pointing to the raw dataset files is returned.
    • mode (str , optional) – The read mode, one of "current", "previous", or "added". Defaults to "current".
    • force_dataset_download (bool , optional) – Whether the dataset must be re-downloaded even if present in local content. Defaults to False.
    • schema (FoundryFieldSchema , optional) – The schema to apply if reading an empty incremental output.
  • Returns: The dataset contents in the requested format.
  • Return type: pyarrow.Table | pandas.DataFrame | polars.DataFrame | polars.LazyFrame | str ↗

When columns, row_limit, or filters applied via the where() method are set, the output format must be one of "arrow", "dataframe", "pandas", or "polars", and mode must be "current".

write_table(df, column_descriptions=None)

Upload tabular data to a Foundry dataset. This uploads the data, infers a schema, and updates column description metadata.

Accepts a pandas DataFrame, Arrow Table, Polars DataFrame, DuckDB PyRelation, or a path (string or pathlib.Path) pointing to a raw dataset.

  • Parameters:
    • df – The data to upload. Accepts pandas.DataFrame, pyarrow.Table, polars.DataFrame, DuckDB PyRelation, or a path matching write_table_path.
    • column_descriptions (Dict [str , str ] , optional) – Map of column names to their string descriptions. This map is intersected with the columns of the DataFrame, and must include descriptions no longer than 800 characters.
  • Returns: None

put_metadata(column_descriptions=None)

Finalize a dataset after uploading raw Parquet files. This infers a Foundry schema from the uploaded Parquet and updates column description metadata on the dataset.

You must call this method after one or more Parquet files have been uploaded to the output dataset so that a schema can be inferred. The method will throw if it is called before a successful file upload.

  • Parameters: column_descriptions (Dict [str , str ] , optional) – Map of column names to their string descriptions. This map is intersected with the columns of the dataset, and must include descriptions no longer than 800 characters.
  • Returns: None

set_write_mode(mode)

Set the write mode of the dataset.

  • Parameters: mode (str) – The write mode, one of "replace", "modify", or "append". In modify mode, anything written is appended to the dataset and may also override existing files. In append mode, anything written is appended to the dataset and will not override existing files. In replace mode, anything written replaces the dataset.
  • Returns: None

The write mode cannot be changed after data has been written.

files(mode='current', show_hidden_files=False)

List files in a Foundry dataset.

  • Parameters:
    • mode (str , optional) – The read mode, one of "current", "previous", or "added". Defaults to "current".
    • show_hidden_files (bool , optional) – Whether to list hidden files. Defaults to False.
  • Returns: The collection of files in the dataset.
  • Return type: FileCollection

upload_file(path, logical_path=None)

Upload a local file to a Foundry dataset.

  • Parameters:
    • path (str) – The path to the local file to upload.
    • logical_path (str , optional) – The destination path in the Foundry dataset. If not provided, the file is uploaded to the root with the same name as the local file.
  • Returns: The name of the uploaded Foundry dataset file.
  • Return type: str ↗

upload_directory(local_dir_path)

Upload a local directory to a Foundry dataset. All files found recursively inside the directory will be uploaded.

  • Parameters: local_dir_path (str) – The path to the local directory to upload.
  • Returns: A map of local file paths to the corresponding Foundry dataset file paths.
  • Return type: Dict [str , str ]

where(operand_filter)

Apply a row filter to the dataset. Returns the dataset so that calls can be chained. Filters are applied when read_table is called.

  • Parameters: operand_filter – A filter expression built using Column.get().
  • Returns: The filtered dataset.
  • Return type: Dataset

Supported operators on Column:

  • ==, !=, >, >=, <, <=
  • .isnull()
  • .isin(values)
  • .between(lower, upper)

Combine filters with & (and), | (or), and ~ (not).

Copied!
1 2 3 4 5 6 from foundry.transforms import Dataset from foundry.transforms import Column ds = Dataset.get("my_dataset") filtered = ds.where(Column.get("age") > 18) result = filtered.read_table(format="pandas")

select(*column_names)

Select a subset of columns from the dataset. Returns the dataset so that calls can be chained.

  • Parameters: column_names (str) – The names of the columns to select.
  • Returns: The dataset with the column selection applied.
  • Return type: Dataset

limit(row_limit)

Set the maximum number of rows to read. Returns the dataset so that calls can be chained.

  • Parameters: row_limit (int) – The maximum number of rows.
  • Returns: The dataset with the row limit applied.
  • Return type: Dataset

abort()

Abort all work on this dataset. Any data written before or after calling this method will be ignored.

  • Returns: The aborted dataset.
  • Return type: Dataset