Package com.palantir.transforms.excel
package com.palantir.transforms.excel
A Java transforms library for efficiently converting from a
Dataset<PortableFile>
containing Excel files to one or more instances of Dataset<Row>
that can be written to multiple output
datasets in a single transform.
It supports extraction from "table-like" sections of sheets and from "form-style" sheets, and it is fully pluggable
so you can add your own implementation of "get some data from Excel" and benefit from the multiple-output plumbing.
High-level flow of a transform that uses this library
- Create an appropriately configured instance of
TransformsExcelParser
, in one of two ways.- With a single
Parser
using theTransformsExcelParser.of(com.palantir.transforms.excel.Parser)
static constructor, if you only care about extracting oneDataset<Row>
and writing it to a single output. - With multiple Parsers, each associated with a "key" (an arbitrary string), by creating
a builder with
TransformsExcelParser.builder()
and using the builder'sputKeyToParser
method.
- With a single
- Call the
TransformsExcelParser.parse(org.apache.spark.sql.Dataset)
method with the inputDataset<PortableFile>
and receive aParseResult
. - If you only passed a single parser, use the
ParseResult.singleResult()
method to obtain the output dataframe. Otherwise, retrieve the output dataframes by key using theParseResult.dataframeForKey(java.lang.String)
method. - If you care about errors, such as files not being extractable due to being malformed or being
password-protected, you can get those errors from the
ParseResult.errorDataframe()
method. This dataframe will have three columns:file_path
,parser_key
(which will be null if the error isn't parser-specific), anderror
(a string representing the error). Note that thiserrorDataframe
is the only way to identify such errors, so a common implementation pattern for pipelines where fail-fast behavior is desired is to raise an Exception if theerrorDataframe
is not empty.
Minimal example (non-incremental)
import com.palantir.transforms.excel.ParseResult; import com.palantir.transforms.excel.Parser; import com.palantir.transforms.excel.TransformsExcelParser; import com.palantir.transforms.excel.table.TableParser; import com.palantir.transforms.lang.java.api.Compute; import com.palantir.transforms.lang.java.api.FoundryInput; import com.palantir.transforms.lang.java.api.FoundryOutput; import com.palantir.transforms.lang.java.api.Input; import com.palantir.transforms.lang.java.api.Output; import com.palantir.transforms.lang.java.api.TransformProfiles; import java.util.Optional; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; public final class SimpleTabularExcel { @Compute @TransformProfiles({"EXECUTOR_MEMORY_MEDIUM"}) public void myComputeFunction( @Input("/path/to/input") FoundryInput myInput, @Output("/path/to/output") FoundryOutput myOutput) { Parser tableParser = TableParser.builder().build(); TransformsExcelParser transformsParser = TransformsExcelParser.of(tableParser); ParseResult result = transformsParser.parse(myInput.asFiles().getFileSystem().filesAsDataset()); Optional<Dataset<Row>> df = result.singleResult(); myOutput.getDataFrameWriter(df.get()).write(); } }
Incremental example
import com.palantir.transforms.excel.ParseResult; import com.palantir.transforms.excel.Parser; import com.palantir.transforms.excel.TransformsExcelParser; import com.palantir.transforms.excel.functions.RegexSubstringMatchingSheetSelector; import com.palantir.transforms.excel.table.CaseNormalizationOption; import com.palantir.transforms.excel.table.SimpleHeaderExtractor; import com.palantir.transforms.excel.table.TableParser; import com.palantir.transforms.excel.utils.IncrementalUtils; import com.palantir.transforms.lang.java.api.*; public final class IncrementalTransform { @Compute public void myComputeFunction( @Input("/path/to/input") FoundryInput myInput, @Output("/path/to/sheet1_output") FoundryOutput sheet1Output, @Output("/path/to/sheet2_output") FoundryOutput sheet2Output) { // Specifying either CONVERT_TO_LOWERCASE or CONVERT_TO_UPPERCASE for a // CaseNormalizationOption is especially important with incremental processing // to avoid subtle issues due to inconsistent casing between input files across // incremental batches. Parser sheet1Parser = TableParser.builder() .headerExtractor(SimpleHeaderExtractor.builder() .caseNormalizationOption(CaseNormalizationOption.CONVERT_TO_LOWERCASE).build()) .sheetSelector(new RegexSubstringMatchingSheetSelector("Sheet1")).build(); Parser sheet2Parser = TableParser.builder() .headerExtractor(SimpleHeaderExtractor.builder() .caseNormalizationOption(CaseNormalizationOption.CONVERT_TO_LOWERCASE).build()) .sheetSelector(new RegexSubstringMatchingSheetSelector("Sheet2")).build(); TransformsExcelParser transformsParser = TransformsExcelParser.builder().putKeyToParser("Sheet1", sheet1Parser) .putKeyToParser("Sheet2", sheet2Parser).build(); FoundryFiles foundryFiles = myInput.asFiles(); ParseResult result = transformsParser.parse(foundryFiles.getFileSystem(ReadRange.UNPROCESSED).filesAsDataset()); // With incremental processing in particular, you often want fail-fast // behavior as below, instead of writing the error dataframe to a separate output and // checking it asynchronously. If fail-fast is not an option and you // adopt the "write error dataframe to separate output" approach, note // that you will need to either ① re-upload files that had parse errors to the input dataset // or ② force a snapshot build of this transform via a manual dummy transaction // on one of the output datasets in order to trigger the reprocessing of the // files that had parse errors. if (result.errorDataframe().count() > 1) { throw new RuntimeException("Errors: " + result.errorDataframe().collectAsList().toString()); } FilesModificationType filesModificationType = foundryFiles.modificationType(); // The below logic assumes that it is normal and expected for a subset of the // parsers to find no data in a given incremental batch of files. If that is not // the case, you may want to raise an exception if a subset of the dataframes is // absent and there were a non-zero number of unprocessed files in the input. // An absent result does not necessarily mean that that an error will be present // in the error dataframe (for example, a SheetSelector returning an empty // collection of sheets is not considered an error). result.dataframeForKey("Sheet1").ifPresent( dataframe -> IncrementalUtils.writeAppendingIfPossible(filesModificationType, dataframe, sheet1Output)); result.dataframeForKey("Sheet2").ifPresent( dataframe -> IncrementalUtils.writeAppendingIfPossible(filesModificationType, dataframe, sheet2Output)); } }
Miscellaneous notes
- xls, xlsx, and xlsm files are supported. xlsb is not currently supported.
- This library uses Apache POI under the hood (and deliberately exposes some Apache POI classes as part of its API to facilitate customization of behavior as required).
- Because Apache POI is a memory-hungry library, it is often necessary to use at least the
EXECUTOR_MEMORY_MEDIUM
spark profile, even for relatively small Excel files on the order of 10 MB. For large files on the order of 120 MB, it can be necessary to use the combination ofEXECUTOR_MEMORY_LARGE
andEXECUTOR_CORES_EXTRA_SMALL
to grant the absolute maximum possible amount of memory to each Spark task. - Insufficient memory when trying to process a large Excel file can result in a transform hanging indefinitely without succeeding or failing. If a transform has been running for an excessive amount of time (on the order of 10 minutes or longer) with no progress in completed tasks, you should usually cancel the build and try again with a higher-memory or lower-core profile. Taking on the order of 8 minutes to process an Excel file on the order of 120 MB is expected, even when memory is sufficient.
-
ClassDescriptionThis class is an implementation detail that is only public because being public is a requirement for Kryo serialization.A high-level abstraction over the idea of taking a
WorkbookWithMetadata
and returning some data.The return value fromTransformsExcelParser.parse(Dataset)
.A function to provide a set of passwords to use to try to decrypt a file, given the file path.This class is an implementation detail that is only public because being public is a requirement for Kryo serialization.A class for extracting data from a dataset of Excel files and returning aParseResult
including error details, decryption success/failure details, and one or more dataframes.A builder for constructing an instance ofTransformsExcelParser
with customized settings and/or multiple outputs.AWorkbook
object (the result of loading an Excel file from a dataset into Apache POI) along with information about the file in the dataset such as its path.