Extract rows from an excel file

Supported in: Batch

Reads a dataset of excel files and parses each file into rows. Supported file formats: .xls, .xlt, .xltm, .xltx, .xlsx, .xlsm.

Transform categories: File

Declared arguments

  • Dataset - Dataset of files to process.
    Files
  • Rows to skip - Number of rows to skip at the start of each sheet. If you do not use the "Treat first row (after skipping) as header" option and a header is present, this value should include the header row.
    Literal<Integer>
  • Schema - An ordered list of column names corresponding to data in the sheets that match the specified pattern.
    List<Literal<String>>
  • Source sheet pattern - Data will be extracted from all sheets whose names contain a substring matching this regular expression. If you specify the empty string (the default value for this parameter), data will be extracted from all sheets. To do a full-string match instead of a substring match, you can add ^ at the start and $ at the end of the string.
    Literal<String>
  • optional Output column to put the file path - If present, an output column will be created with this column name, containing the path of the parsed file.
    Literal<String>
  • optional Output column to put the sheet name - If present, an output column will be created with this column name, containing the sheet name of the parsed row.
    Literal<String>
  • optional Treat first row (after skipping) as header - If true, the first row in each sheet after skipping will be treated as a header, and the order of the fields in that header will be used to determine the mapping between the columns in the sheet and the columns in the schema. The order of the fields can differ between sheets (and files), and it is neither necessary for all fields specified in the schema to be present in all sheets nor for all fields present in the Excel sheets to be included in the schema. The strings present in the cell values of the header row will be mapped to the schema column names case-insensitively after applying the following sanitization steps:
  1. If present, remove any string-initial sequence of any combination of the characters (),;{}\n\t= (the first character in this list is the ASCII space).
  2. Replace all remaining instances of (),;{}\n\t= with underscores.
  3. Replace all consecutive underscores with one underscore.
  4. If the string now ends with an underscore, remove that underscore.
    Literal<Boolean>