Package com.palantir.transforms.excel


package com.palantir.transforms.excel
A Java transforms library for efficiently converting from a Dataset<PortableFile> containing Excel files to one or more instances of Dataset<Row> that can be written to multiple output datasets in a single transform. It supports extraction from "table-like" sections of sheets and from "form-style" sheets, and it is fully pluggable so you can add your own implementation of "get some data from Excel" and benefit from the multiple-output plumbing.

High-level flow of a transform that uses this library

Minimal example (non-incremental)

 import com.palantir.transforms.excel.ParseResult;
 import com.palantir.transforms.excel.Parser;
 import com.palantir.transforms.excel.TransformsExcelParser;
 import com.palantir.transforms.excel.table.TableParser;
 import com.palantir.transforms.lang.java.api.Compute;
 import com.palantir.transforms.lang.java.api.FoundryInput;
 import com.palantir.transforms.lang.java.api.FoundryOutput;
 import com.palantir.transforms.lang.java.api.Input;
 import com.palantir.transforms.lang.java.api.Output;
 import com.palantir.transforms.lang.java.api.TransformProfiles;
 import java.util.Optional;
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;

 public final class SimpleTabularExcel {

     @Compute
     @TransformProfiles({"EXECUTOR_MEMORY_MEDIUM"})
     public void myComputeFunction(
             @Input("/path/to/input") FoundryInput myInput,
             @Output("/path/to/output") FoundryOutput myOutput) {
         Parser tableParser = TableParser.builder().build();
         TransformsExcelParser transformsParser = TransformsExcelParser.of(tableParser);
         ParseResult result =
                 transformsParser.parse(myInput.asFiles().getFileSystem().filesAsDataset());
         Optional<Dataset<Row>> df = result.singleResult();
         myOutput.getDataFrameWriter(df.get()).write();
     }
 }
 

Incremental example

 import com.palantir.transforms.excel.ParseResult;
 import com.palantir.transforms.excel.Parser;
 import com.palantir.transforms.excel.TransformsExcelParser;
 import com.palantir.transforms.excel.functions.RegexSubstringMatchingSheetSelector;
 import com.palantir.transforms.excel.table.CaseNormalizationOption;
 import com.palantir.transforms.excel.table.SimpleHeaderExtractor;
 import com.palantir.transforms.excel.table.TableParser;
 import com.palantir.transforms.excel.utils.IncrementalUtils;
 import com.palantir.transforms.lang.java.api.*;

 public final class IncrementalTransform {

     @Compute
     public void myComputeFunction(
             @Input("/path/to/input") FoundryInput myInput,
             @Output("/path/to/sheet1_output") FoundryOutput sheet1Output,
             @Output("/path/to/sheet2_output") FoundryOutput sheet2Output) {
         // Specifying either CONVERT_TO_LOWERCASE or CONVERT_TO_UPPERCASE for a
         // CaseNormalizationOption is especially important with incremental processing
         // to avoid subtle issues due to inconsistent casing between input files across
         // incremental batches.
         Parser sheet1Parser = TableParser.builder()
                 .headerExtractor(SimpleHeaderExtractor.builder()
                         .caseNormalizationOption(CaseNormalizationOption.CONVERT_TO_LOWERCASE).build())
                 .sheetSelector(new RegexSubstringMatchingSheetSelector("Sheet1")).build();
         Parser sheet2Parser = TableParser.builder()
                 .headerExtractor(SimpleHeaderExtractor.builder()
                         .caseNormalizationOption(CaseNormalizationOption.CONVERT_TO_LOWERCASE).build())
                 .sheetSelector(new RegexSubstringMatchingSheetSelector("Sheet2")).build();
         TransformsExcelParser transformsParser = TransformsExcelParser.builder().putKeyToParser("Sheet1", sheet1Parser)
                 .putKeyToParser("Sheet2", sheet2Parser).build();
         FoundryFiles foundryFiles = myInput.asFiles();
         ParseResult result = transformsParser.parse(foundryFiles.getFileSystem(ReadRange.UNPROCESSED).filesAsDataset());
         // With incremental processing in particular, you often want fail-fast
         // behavior as below, instead of writing the error dataframe to a separate output and
         // checking it asynchronously. If fail-fast is not an option and you
         // adopt the "write error dataframe to separate output" approach, note
         // that you will need to either ① re-upload files that had parse errors to the input dataset
         // or ② force a snapshot build of this transform via a manual dummy transaction
         // on one of the output datasets in order to trigger the reprocessing of the
         // files that had parse errors.
         if (result.errorDataframe().count() > 1) {
             throw new RuntimeException("Errors: " + result.errorDataframe().collectAsList().toString());
         }
         FilesModificationType filesModificationType = foundryFiles.modificationType();
         // The below logic assumes that it is normal and expected for a subset of the
         // parsers to find no data in a given incremental batch of files. If that is not
         // the case, you may want to raise an exception if a subset of the dataframes is
         // absent and there were a non-zero number of unprocessed files in the input.
         // An absent result does not necessarily mean that that an error will be present
         // in the error dataframe (for example, a SheetSelector returning an empty
         // collection of sheets is not considered an error).
         result.dataframeForKey("Sheet1").ifPresent(
                 dataframe -> IncrementalUtils.writeAppendingIfPossible(filesModificationType, dataframe, sheet1Output));
         result.dataframeForKey("Sheet2").ifPresent(
                 dataframe -> IncrementalUtils.writeAppendingIfPossible(filesModificationType, dataframe, sheet2Output));
     }
 }
 

Miscellaneous notes

  • xls, xlsx, and xlsm files are supported. xlsb is not currently supported.
  • This library uses Apache POI under the hood (and deliberately exposes some Apache POI classes as part of its API to facilitate customization of behavior as required).
  • Because Apache POI is a memory-hungry library, it is often necessary to use at least the EXECUTOR_MEMORY_MEDIUM spark profile, even for relatively small Excel files on the order of 10 MB. For large files on the order of 120 MB, it can be necessary to use the combination of EXECUTOR_MEMORY_LARGE and EXECUTOR_CORES_EXTRA_SMALL to grant the absolute maximum possible amount of memory to each Spark task.
  • Insufficient memory when trying to process a large Excel file can result in a transform hanging indefinitely without succeeding or failing. If a transform has been running for an excessive amount of time (on the order of 10 minutes or longer) with no progress in completed tasks, you should usually cancel the build and try again with a higher-memory or lower-core profile. Taking on the order of 8 minutes to process an Excel file on the order of 120 MB is expected, even when memory is sufficient.