Class TransformsExcelParser

java.lang.Object
com.palantir.transforms.excel.TransformsExcelParser
All Implemented Interfaces:
Serializable

@Immutable public abstract class TransformsExcelParser extends Object implements Serializable
A class for extracting data from a dataset of Excel files and returning a ParseResult including error details, decryption success/failure details, and one or more dataframes.

A user of this library will generally construct exactly one instance of this class as part of their transform code.

See Also:
  • Constructor Details

    • TransformsExcelParser

      public TransformsExcelParser()
  • Method Details

    • keyToParser

      protected abstract Map<String,Parser> keyToParser()
      A mapping between keys (arbitrary strings) and their associated Parser. When there is exactly one Parser, the key does not matter, because you can retrieve the result of parsing from the ParseResult.singleResult() method without considering the key.
    • passwordProvider

      protected abstract Optional<PasswordProvider> passwordProvider()
      A function to provide a set of passwords to try, given a workbook.
    • maxByteArraySize

      @Default protected Integer maxByteArraySize()
      A setting that needs to be set to a large value in order to open large Excel files.

      The default value used by this library is Integer.MAX_VALUE, and consumers should almost never change that, because failing to process large files is not desirable in most pipelines.

      See Also:
    • minInflateRatio

      @Default protected Double minInflateRatio()
      A setting that controls the lowest acceptable size ratio of compressed to uncompressed files when attempting to open an xlsx or xlsm file (these file types are actually zip archives).

      This parameter is used in Apache POI to detect zip bombs (malicious files that when uncompressed can be much larger than they appear compressed). In practice, Excel files with a high compression ratio are rarely actually zip bombs, so we set this to an arbitrarily low value by default (0.000000000000001) instead of the usual Apache POI default of 0.01 (a 100x compression ratio).

      See Also:
    • includeFileModifiedTimestamp

      @Default protected Boolean includeFileModifiedTimestamp()
      A default-true setting that controls whether a _file_modified_timestamp column should be included in output dataframes.

      This column is useful for cases where input files can be changed and processing downstream should be incremental. The combination of _file_path and _file_modified_timestamp can then be used as a key to identify records from a unique instance of a file at a moment in time.

    • check

      @Check protected final void check()
    • of

      public static TransformsExcelParser of(Parser parser)
      Create a TransformsExcelParser with default configuration from a single Parser. This is a convenience method for when there is only one output and configuration options do not need to be customized. Otherwise, use builder().
    • of

      public static TransformsExcelParser of(Parser parser, PasswordProvider passwordProvider)
      Create a TransformsExcelParser with default configuration from a single Parser and PasswordProvider. This is a convenience method for when there is only one output and configuration options do not need to be customized. Otherwise, use builder().
    • builder

      public static TransformsExcelParser.Builder builder()
      Create a builder for constructing an instance of TransformsExcelParser with customized settings and/or multiple outputs.
    • parse

      public final ParseResult parse(org.apache.spark.sql.Dataset<com.palantir.spark.binarystream.data.PortableFile> files)
      Process the input dataset and return a ParseResult.

      Because this method takes a Dataset<PortableFile> and not a FoundryInput as input, it is the responsibility of the consumer to implement incremental processing as appropriate (this method is agnostic with respect to whether it is called within an incremental or a snapshot pipeline).