Configuration reference

Warning

This section describes advanced manual settings that can bring your SDDI pipeline into a broken state if not applied correctly. Always verify changes on a branch before deploying to production.

SDDI's pipeline is generated by a fully automated code repository. Cockpit is the default place to interact with those configurations, but you may have to manually amend the configuration files for advanced parameters or to configure non-standard source types.

To review the steps involved, read about pipeline generation.

Configurations are performed within two main files located in the transforms-bellhop/src/config/ folder:

SourceConfig.yaml

The following is a notional example of a fully-defined SourceConfig file.

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 sourceName: MY_SOURCE sourceRid: ri.magritte..source.abcdefgh-1234-5678-910a-zyxwvut sapContext: type: direct rawFolderStructure: raw: /HyperAuto/source/raw dataDictionary: /HyperAuto/source/metadata cleaningLibraries: - convert_all_columns_to_clean_types deploymentSemanticVersion: 2 metadataSparkProfiles: - DRIVER_MEMORY_MEDIUM languageKey: 'E' tables: - tableName: ABCD datasetTransformsConfig: datasetName: ABCD deduplicationComparisonColumns: [] batchUnionComponents: [] tableCleaningLibraries: [] - tableName: WXYZ datasetTransformsConfig: datasetName: WXYZ deduplicationComparisonColumns: - /PALANTIR/TIMESTAMP - /PALANTIR/ROWNO batchUnionComponents: - WXYZ_historical - WXYZ_incremental tableCleaningLibraries: - parse_timestamp_column sparkProfiles: profiles: - EXECUTOR_MEMORY_MEDIUM - NUM_EXECUTORS_4

Parameters description

ParameterDescription
sourceNameThe name to identify a source system. Used to prefix primary and foreign keys.
sourceRidThe RID of the source attached to this SDDI instance.
sapContext(Optional) Details of the SAP context.
rawFolderStructureDefines the folders in which the raw data and metadata reside.
cleaningLibrariesList of cleaning libraries to apply to all tables.
deduplicationConfig(Optional, default: None) Config used to specify which columns to use for the deduplication logic.
metadataSparkProfiles(Optional, default: None) List of Spark profiles to apply to metadata generation.
languageKey(Optional, default: 'E') Language to use in enrichments.
deploymentSemanticVersion(Optional, default: 0) Semantic version of the pipeline; incrementing it will force a snapshot.
tablesList of tables from that source to be processed by SDDI.

sapContext

(Optional) Details of the SAP context. SAP Explorer will use this to pre-select the context. Each context will need to have its own SourceConfig file.

rawFolderStructure

Defines the folders in which the raw data and metadata reside.

Fields:

  • raw: Path of folder where raw tables are ingested.
  • dataDictionary: (Optional, default:raw) Path of folder where metadata tables are ingested.

cleaningLibraries

List of cleaning libraries to apply to all tables. Cleaning functions are defined in transforms-bellhop/src/software_defined_integrations/transforms/cleaned/function_libraries.

Adding or removing a function requires incrementing the deploymentSemanticVersion.

deduplicationConfig

(Optional, default: None) Config used to specify which columns to use for the deduplication logic. Configuration defined here is applied across all tables.

Fields:

  • comparisonColumns: Columns for which the max value will be taken to determine the uniqueness of primary keys.
  • changeModeColumn: (Optional) If specified, rows having value D in this column will be deleted.

deploymentSemanticVersion

(Optional, default: 0) Semantic version of the pipeline; incrementing it will force a snapshot.

See Incremental Transforms for the effects of deploymentSemanticVersion on incremental and snapshot transforms.

metadataSparkProfiles

(Optional, default: None) List of Spark profiles to apply to metadata dataset generation (objects, fields, links and diffs).

Be sure the profiles are added to the repository before referencing them here.

tables

List of tables from defined source to be processed by SDDI.

Fields:

  • tableName: Name of the table in metadata.
  • datasetTransformsConfig
    • datasetName: Foundry dataset name of the raw data.
    • deduplicationComparisonColumns: Table-specific config used to deduplicate data and specify which columns to use for the deduplication logic. Applied after the global deduplication fields.
    • changeModeColumn: (Optional) If specified, rows having value D in this column will be deleted. Applied over the global change mode column.
    • batchUnionComponents: List of input dataset names that should be unioned before the cleaning step.
    • sparkProfiles: (Optional) Spark profiles to apply at different stages of the transforms.
      • profiles: Spark profiles; see details for adding them to the repository.
      • stages: (Optional, default: None) Transform stages the profiles should be applied to. Value should be in [CLEANED, DERIVED, ENRICHED, FINAL, RENAMED, RENAMED_CHANGELOG]. If None, profiles are applied at all stages.
    • tableCleaningLibraries: List of cleaning libraries to apply to this table. Cleaning functions are defined in transforms-bellhop/src/software_defined_integrations/transforms/cleaned/function_libraries. Adding or removing a function will require you to increment the deploymentSemanticVersion.
    • enforceUniquePrimaryKeys: (Optional, default: False). If True and deduplicationComparisonColumns are defined, guarantees that only one record per primary key will be kept at the deduplication stage. This may result in non-deterministic behavior.

PipelineConfig.yaml

Example of a notional fully-defined PipelineConfig file.

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 sourceName: HyperAuto sourceType: SAP_ERP sourceConfigFileNames: - SourceConfig.yaml outputFolder: /HyperAuto/source/output workflows: my_workflow: variables: - name: my_variable_name value: my_variable_value enrichments: - my_enrichment_name tables: ABCD: displayName: Header Table types: - OBJECT WXYZ: displayName: Item Table types: - OBJECT - METADATA disableForeignKeyGeneration: False disableEnrichedStage: False disableRenamedStage: False

Parameters description

ParameterDescription
projectNameProject name. Serves as a prefix to Ontology objects.
sourceTypeType of sources supported by SDDI. Should be one of [SAP_ERP, SALESFORCE, ORACLE_NETSUITE].
sourceConfigFileNamesList of SourceConfig filenames to include in the pipeline.
outputFolderDefines the folder in which output datasets will be written.
workflowsList of workflows to deploy, with configurations.
tablesList of tables processed in this SDDI pipeline.
disableEnrichedStage(Optional, default: False) If enabled, no enriched datasets will be produced. Use with caution, as enabling will break workflows.
disableRenamedStage(Optional, default: False) If enabled, no renamed_changelog datasets will be produced. Use with caution, as enabling will break workflows.
disableForeignKeyGenerationIf enabled, no foreign key columns will be produced. Use with caution, as enabling will break workflows.

tables

List of tables processed in this SDDI pipeline:

  • displayName: Human-readable name of the table. Output dataset name will be constructed in the form displayName (technicalName)
  • types: List of data types this table represents (can be many).
    • OBJECT: Master data table that constitutes an object in the ontology.
    • METADATA: Metadata table that contains information on objects and constructing primary keys.
    • CUSTOMIZATION: Enrichment table that is joined to master data tables at enriched step of SDDI pipeline.