Class TransformsExcelParser
- All Implemented Interfaces:
Serializable
ParseResult
including
error details, decryption success/failure details, and one or more dataframes.
A user of this library will generally construct exactly one instance of this class as part of their transform code.
- See Also:
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic final class
A builder for constructing an instance ofTransformsExcelParser
with customized settings and/or multiple outputs. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionbuilder()
Create a builder for constructing an instance ofTransformsExcelParser
with customized settings and/or multiple outputs.protected final void
check()
protected Boolean
A default-true setting that controls whether a_file_modified_timestamp
column should be included in output dataframes.A mapping between keys (arbitrary strings) and their associatedParser
.protected Integer
A setting that needs to be set to a large value in order to open large Excel files.protected Double
A setting that controls the lowest acceptable size ratio of compressed to uncompressed files when attempting to open an xlsx or xlsm file (these file types are actually zip archives).static TransformsExcelParser
Create aTransformsExcelParser
with default configuration from a singleParser
.static TransformsExcelParser
of
(Parser parser, PasswordProvider passwordProvider) Create aTransformsExcelParser
with default configuration from a singleParser
andPasswordProvider
.final ParseResult
parse
(org.apache.spark.sql.Dataset<com.palantir.spark.binarystream.data.PortableFile> files) Process the input dataset and return aParseResult
.protected abstract Optional<PasswordProvider>
A function to provide a set of passwords to try, given a workbook.
-
Constructor Details
-
TransformsExcelParser
public TransformsExcelParser()
-
-
Method Details
-
keyToParser
A mapping between keys (arbitrary strings) and their associatedParser
. When there is exactly one Parser, the key does not matter, because you can retrieve the result of parsing from theParseResult.singleResult()
method without considering the key. -
passwordProvider
A function to provide a set of passwords to try, given a workbook. -
maxByteArraySize
A setting that needs to be set to a large value in order to open large Excel files.The default value used by this library is Integer.MAX_VALUE, and consumers should almost never change that, because failing to process large files is not desirable in most pipelines.
-
minInflateRatio
A setting that controls the lowest acceptable size ratio of compressed to uncompressed files when attempting to open an xlsx or xlsm file (these file types are actually zip archives).This parameter is used in Apache POI to detect zip bombs (malicious files that when uncompressed can be much larger than they appear compressed). In practice, Excel files with a high compression ratio are rarely actually zip bombs, so we set this to an arbitrarily low value by default (0.000000000000001) instead of the usual Apache POI default of 0.01 (a 100x compression ratio).
-
includeFileModifiedTimestamp
A default-true setting that controls whether a_file_modified_timestamp
column should be included in output dataframes.This column is useful for cases where input files can be changed and processing downstream should be incremental. The combination of
_file_path
and_file_modified_timestamp
can then be used as a key to identify records from a unique instance of a file at a moment in time. -
check
@Check protected final void check() -
of
-
of
Create aTransformsExcelParser
with default configuration from a singleParser
andPasswordProvider
. This is a convenience method for when there is only one output and configuration options do not need to be customized. Otherwise, usebuilder()
. -
builder
Create a builder for constructing an instance ofTransformsExcelParser
with customized settings and/or multiple outputs. -
parse
public final ParseResult parse(org.apache.spark.sql.Dataset<com.palantir.spark.binarystream.data.PortableFile> files) Process the input dataset and return aParseResult
.Because this method takes a
Dataset<PortableFile>
and not aFoundryInput
as input, it is the responsibility of the consumer to implement incremental processing as appropriate (this method is agnostic with respect to whether it is called within an incremental or a snapshot pipeline).
-