There are many ways to extract tabular data from a schemaless dataset containing Microsoft Excel files in Foundry, including Pipeline Builder, Python file-based transforms using open-source libraries like openpyxl ↗, and Java file-based transforms using open-source libraries like Apache POI ↗.
In addition to these options, Palantir provides a library called transforms-excel-parser
, which wraps Apache POI with sensible default behavior to make it easy to use in a transforms-java repository with minimal configuration.
Some examples of the useful features and behavior provided by this library include the following:
errorDataframe()
method on the ParseResult
class, which can be checked at runtime to fail a job or alternatively written to a separate output and checked asynchronously.Search for transforms-excel-parser
in the repository's Maven library panel.
Select the latest version available.
This will show a dialog for importing transforms-excel-parser-bundle
as an additional backing repository. Select "Add library".
You may see eddie-spark-module-bundle
and/or ri.eddie.artifacts.repository.bundles
in addition to transforms-excel-parser-bundle
as dropdown options.
The backing repositories with eddie
in their name are intended for the exclusive use of the Pipeline Builder application, so they are not the appropriate choice, and using them may lead to issues in the future.
If you do not see transforms-excel-parser-bundle
as an option, contact your Palantir representative for installation.
Add the latest version available to your transforms-java/build.gradle
file as below.
transforms-java/build.gradle
is a hidden file, so you will need to toggle the Show hidden files setting in order to see it.
For detailed API documentation, download the javadoc archive, unzip it, and view the contained HTML files in a web browser.
The best place to start when reading the javadoc is com/palantir/transforms/excel/package-summary.html
.
The following file formats are currently supported:
Note that xlsb files are not currently supported.
When running Code Assist preview, you may observe an issue where the first run after workspace startup succeeds and the second run fails with an error similar to the below:
java.lang.ClassCastException: class com.palantir.transforms.excel.KeyedParsedRecord cannot be cast to class com.palantir.transforms.excel.KeyedParsedRecord (com.palantir.transforms.excel.KeyedParsedRecord is in unnamed module of loader java.net.URLClassLoader @5a5d825a; com.palantir.transforms.excel.KeyedParsedRecord is in unnamed module of loader java.net.URLClassLoader @53dafc50)
This issue is exclusive to the Code Assist preview functionality and does not lead to issues at build time. Refreshing your browser window should allow you to preview again without performing a full Code Assist workspace rebuild.
The Apache POI library is known for its high memory consumption, which means that even relatively small Excel files can result in a considerable memory footprint when opened. As a result, it is common for default transform Spark profile settings to provide insufficient memory-per-task to accomodate the in-memory objects. Insufficient memory can result in the following issues:
Spark module '{module_rid}' died while job '{job_rid}' was using it
.A rough guideline for identifying if a job is stalling is whether it takes more than 10 minutes to process a single file, since given sufficient memory, a very large Excel file can take about 8 minutes to process. Note that a single Spark task can process multiple input files, so it is not always straightforward to apply this rule.
Whether the symptom of insufficient memory is job failure or job stalling, it is advisable to resolve the issue by switching to a spark profile combination that provides more memory per task, such as EXECUTOR_MEMORY_LARGE
and EXECUTOR_CORES_EXTRA_SMALL
.
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
package myproject.datasets; import com.palantir.transforms.excel.ParseResult; import com.palantir.transforms.excel.Parser; import com.palantir.transforms.excel.TransformsExcelParser; import com.palantir.transforms.excel.table.SimpleHeaderExtractor; import com.palantir.transforms.excel.table.TableParser; import com.palantir.transforms.lang.java.api.Compute; import com.palantir.transforms.lang.java.api.FoundryInput; import com.palantir.transforms.lang.java.api.FoundryOutput; import com.palantir.transforms.lang.java.api.Input; import com.palantir.transforms.lang.java.api.Output; import java.util.Optional; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; public final class SimpleTabularExcel { @Compute public void myComputeFunction( @Input("<input_dataset_rid>") FoundryInput myInput, @Output("<output_dataset_rid>") FoundryOutput myOutput, @Output("<error_output_dataset_rid>") FoundryOutput errorOutput ) { // Create a TableParser with an appropriately configured SimpleHeaderExtractor // In this example, the header of the file is on the second row. // If the header were on the first row, we would not need to // specify rowsToSkip, since the default value is 0, and in fact // we could just do TableParser.builder().build() in that case. Parser tableParser = TableParser.builder() .headerExtractor( SimpleHeaderExtractor.builder().rowsToSkip(1).build()) .build(); // Create a TransformsExcelParser with the TableParser TransformsExcelParser transformsParser = TransformsExcelParser.of(tableParser); // Parse input ParseResult result = transformsParser.parse(myInput.asFiles().getFileSystem().filesAsDataset()); // Get the parsed data, which may be empty if there were no rows in the input // or an error occurred Optional<Dataset<Row>> maybeDf = result.singleResult(); // If parsed data is not empty, write it to the output dataset maybeDf.ifPresent(df -> myOutput.getDataFrameWriter(df).write()); // Write error information to the error output errorOutput.getDataFrameWriter(result.errorDataframe()).write(); }
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
package myproject.datasets; import com.palantir.transforms.excel.ParseResult; import com.palantir.transforms.excel.Parser; import com.palantir.transforms.excel.TransformsExcelParser; import com.palantir.transforms.excel.table.MultilayerMergedHeaderExtractor; import com.palantir.transforms.excel.table.TableParser; import com.palantir.transforms.lang.java.api.Compute; import com.palantir.transforms.lang.java.api.FoundryInput; import com.palantir.transforms.lang.java.api.FoundryOutput; import com.palantir.transforms.lang.java.api.Input; import com.palantir.transforms.lang.java.api.Output; import java.util.Optional; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; public final class ComplexHeaderExcel { @Compute public void myComputeFunction( @Input("<input_dataset_rid>") FoundryInput myInput, @Output("<output_dataset_rid>") FoundryOutput myOutput, @Output("<error_output_dataset_rid>") FoundryOutput errorOutput ) { // Create a TableParser with a MultilayerMergedHeaderExtractor Parser tableParser = TableParser.builder() .headerExtractor(MultilayerMergedHeaderExtractor.builder() .topLeftCellName("A1") .bottomRightCellName("D2") .build()) .build(); // Create a TransformsExcelParser with the TableParser TransformsExcelParser transformsParser = TransformsExcelParser.of(tableParser); // Parse input ParseResult result = transformsParser.parse(myInput.asFiles().getFileSystem().filesAsDataset()); // Get the parsed data, which may be empty if there were no rows in the input // or an error occurred Optional<Dataset<Row>> maybeDf = result.singleResult(); // If parsed data is not empty, write it to the output dataset maybeDf.ifPresent(df -> myOutput.getDataFrameWriter(df).write()); // Write error information to the error output errorOutput.getDataFrameWriter(result.errorDataframe()).write(); } }
In this example, we register multiple FormParser
instances, but it is also possible to register a mix of FormParser
and TableParser
instances, and that is a common pattern with complex forms that include tabular elements (within the same sheet or across sheets).
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
package myproject.datasets; import com.palantir.transforms.excel.TransformsExcelParser; import com.palantir.transforms.excel.ParseResult; import com.palantir.transforms.excel.Parser; import com.palantir.transforms.excel.form.FieldSpec; import com.palantir.transforms.excel.form.FormParser; import com.palantir.transforms.excel.form.Location; import com.palantir.transforms.excel.form.cellvalue.AdjacentCellAssertion; import com.palantir.transforms.excel.form.cellvalue.CellValue; import com.palantir.transforms.excel.functions.RegexSubstringMatchingSheetSelector; import com.palantir.transforms.lang.java.api.Compute; import com.palantir.transforms.lang.java.api.FoundryInput; import com.palantir.transforms.lang.java.api.FoundryOutput; import com.palantir.transforms.lang.java.api.Input; import com.palantir.transforms.lang.java.api.Output; public final class FormStyleExcel { private static final String FORM_A_KEY = "FORM_A"; private static final String FORM_B_KEY = "FORM_B"; @Compute public void myComputeFunction( @Input("<input_dataset_rid") FoundryInput myInput, @Output("<form_a_output_dataset_rid>") FoundryOutput formAOutput, @Output("<form_b_output_dataset_rid>") FoundryOutput formBOutput, @Output("<error_output_dataset_rid>") FoundryOutput errorOutput) { // Form A parser configuration Parser formAParser = FormParser.builder() .sheetSelector(new RegexSubstringMatchingSheetSelector("Form_A")) .addFieldSpecs(createFieldSpec("form_a_field_1", "B1")) .addFieldSpecs(createFieldSpec("form_a_field_2", "B2")) .build(); // Form B parser configuration Parser formBParser = FormParser.builder() .sheetSelector(new RegexSubstringMatchingSheetSelector("Form_B")) .sheetSelector(new RegexSubstringMatchingSheetSelector("Form_B")) .addFieldSpecs(createFieldSpec("form_b_field_1", "B1")) .addFieldSpecs(createFieldSpec("form_b_field_2", "B2")) .build(); // TransformsExcelParser with both Form A and Form B parsers TransformsExcelParser transformsParser = TransformsExcelParser.builder() .putKeyToParser(FORM_A_KEY, formAParser) .putKeyToParser(FORM_B_KEY, formBParser) .build(); // Parse input ParseResult result = transformsParser.parse(myInput.asFiles().getFileSystem().filesAsDataset()); // Write parsed data to the output datasets result.dataframeForKey(FORM_A_KEY) .ifPresent(df -> formAOutput.getDataFrameWriter(df).write()); result.dataframeForKey(FORM_B_KEY) .ifPresent(df -> formBOutput.getDataFrameWriter(df).write()); // Write error information to the error output errorOutput.getDataFrameWriter(result.errorDataframe()).write(); } // Helper method to concisely create a FieldSpec with an appropriate assertion private static FieldSpec createFieldSpec(String fieldName, String cellLocation) { return FieldSpec.of( fieldName, CellValue.builder() .addAssertions(AdjacentCellAssertion.left(1, fieldName)) .location(Location.of(cellLocation)) .build()); } }
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
package myproject.datasets; import com.palantir.transforms.excel.ParseResult; import com.palantir.transforms.excel.Parser; import com.palantir.transforms.excel.TransformsExcelParser; import com.palantir.transforms.excel.functions.RegexSubstringMatchingSheetSelector; import com.palantir.transforms.excel.table.CaseNormalizationOption; import com.palantir.transforms.excel.table.SimpleHeaderExtractor; import com.palantir.transforms.excel.table.TableParser; import com.palantir.transforms.excel.utils.IncrementalUtils; import com.palantir.transforms.lang.java.api.*; public final class IncrementalTransform { @Compute public void myComputeFunction( @Input("<input_dataset_rid>") FoundryInput myInput, @Output("<sheet_1_output_dataset_rid>") FoundryOutput sheet1Output, @Output("<sheet_2_output_dataset_rid>") FoundryOutput sheet2Output) { // Define the parsers // Specifying either CONVERT_TO_LOWERCASE or CONVERT_TO_UPPERCASE for a // CaseNormalizationOption is especially important with incremental processing // to avoid subtle issues due to inconsistent casing between input files across // incremental batches. Parser sheet1Parser = TableParser.builder() .headerExtractor(SimpleHeaderExtractor.builder() .caseNormalizationOption(CaseNormalizationOption.CONVERT_TO_LOWERCASE).build()) .sheetSelector(new RegexSubstringMatchingSheetSelector("Sheet1")).build(); Parser sheet2Parser = TableParser.builder() .headerExtractor(SimpleHeaderExtractor.builder() .caseNormalizationOption(CaseNormalizationOption.CONVERT_TO_LOWERCASE).build()) .sheetSelector(new RegexSubstringMatchingSheetSelector("Sheet2")).build(); TransformsExcelParser transformsParser = TransformsExcelParser.builder().putKeyToParser("Sheet1", sheet1Parser) .putKeyToParser("Sheet2", sheet2Parser).build(); // Parse the data FoundryFiles foundryFiles = myInput.asFiles(); ParseResult result = transformsParser.parse(foundryFiles.getFileSystem(ReadRange.UNPROCESSED).filesAsDataset()); // Check for errors and fail fast // With incremental processing in particular, it is often better to fail fast // instead of writing the error dataframe to a separate output and // checking it asynchronously. If failing fast is not an option and you // adopt the "write error dataframe to separate output" approach, note // that you will need to either ① re-upload files that had parse errors to the input dataset // or ② force a snapshot build of this transform via a manual dummy transaction // on one of the output datasets in order to trigger the reprocessing of the // files that had parse errors. if (result.errorDataframe().count() > 1) { throw new RuntimeException("Errors: " + result.errorDataframe().collectAsList().toString()); } // Write parsed data incrementally to the outputs via an APPEND transaction if possible // and a merge-and-replace SNAPSHOT transaction if not // The below implementation assumes that it is normal and expected for a subset of the // parsers to find no data in a given incremental batch of files. If that is not // the case, you may want to raise an exception if a subset of the dataframes is // absent and there were a non-zero number of unprocessed files in the input. // An absent result does not necessarily mean that an error will be present // in the error dataframe (for example, a SheetSelector returning an empty // collection of sheets is not considered an error). FilesModificationType filesModificationType = foundryFiles.modificationType(); result.dataframeForKey("Sheet1").ifPresent( dataframe -> IncrementalUtils.writeAppendingIfPossible(filesModificationType, dataframe, sheet1Output)); result.dataframeForKey("Sheet2").ifPresent( dataframe -> IncrementalUtils.writeAppendingIfPossible(filesModificationType, dataframe, sheet2Output)); } }