Spark profiles reference

This page is a reference of the Spark profiles available in Foundry. Learn more about Spark and profiles here:

Driver cores

The profiles in this family configure the value of spark.driver.cores.

This controls how many CPU cores are assigned to the spark driver. In practice, except in special cases where many spark jobs are running concurrently in the same spark module, this should not need to be overridden.

Driver memory

The profiles in this family configure the value of spark.driver.memory.

This controls how much memory is assigned to the spark driver JVM. This may need to be raised in some situations, for example when collecting large amounts of data back to the driver, or when performing large broadcast joins.

This only controls the JVM memory, not the memory available to Python processes. If you are pulling lots of data locally to perform transformation using Pandas, you will need a different profile.

Executor cores

The profiles in this family configure the value of spark.executor.cores.

This controls how many CPU cores are assigned to each spark executor, which in turn controls how many tasks are run concurrently in each executor. In practice this should rarely need to be overridden in normal transforms jobs.

Executor memory

The profiles in this family configure the value of spark.executor.memory and associated settings.

This controls how much memory is assigned to each spark executor JVM. This may need to be raised if the amount of data being processed in each spark task is very large.

This memory is shared between all tasks running on the executor (controlled by the Executor Cores profiles).

Executor memory overhead

The profiles in this family configure the value of spark.executor.memoryOverhead.

This controls how much memory is assigned to each container in addition to the spark executor JVM memory. This may need to be raised if your job requires a significant amount of memory outside the JVM.

Number of executors

The profiles in this family configure the value of spark.executor.instances and associated settings.

This controls how many executors are requested to run the job. Increasing this value increases the number of tasks which can run in parallel, therefore increasing performance (provided the job is parallel enough) at the cost of using more resources.

In practice this should only need to be overridden for large jobs with a particular organizational need to run very quickly.

Dynamic allocation

The profiles in this family configure the value of spark.dynamicAllocation.enabled, spark.dynamicAllocation.minExecutors and spark.dynamicAllocation.maxExecutors.

This controls how many executors are requested to run the job by specifying a range of executors rather than a fixed count. Spark will scale up the number of executors requested up to maxExecutors and will relinquish the executors when they are not needed, which might be helpful when the exact number of needed executors is not consistently the same, or in some cases for speeding up launch times. The module is not guaranteed to receive the number of requested maxExecutors, and given the variable number of executors, performance might differ from a run to another.

In practice this should only need to be overridden for large jobs with a particular understanding of the advantages and disadvantages of dynamic allocation.

Adaptive query execution

The profiles in this family enable and disable adaptive query execution (AQE).

With AQE enabled, Spark will automatically set the number of partitions at runtime, potentially speeding up your builds. It avoids too few partitions with insufficient parallelism, and too many small partitions with excessive overhead.

AQE aims for a balanced output size of 64 MB per partition. E.g. a total output size of 512 MB will produce around 8 partitions.

You can increase the target size using the file size profiles in this family. Partition sizes of 128MB and larger are recommended if the data written is frequently read, e.g. in Contour analyses.

You might want to disable AQE if the total output is small but very time-intensive to compute, for example because of expensive UDFs. In that case AQE can reduce parallelism and slow down your computation.

Number of cores per task

The profiles in this family configure the value of spark.task.cpus.

This controls how many cores are allocated for each task. In practice this should only rarely be overriden. If you want to control the parallelism of your job you should look into Executor Cores instead.

Arrow

Use these profiles to enable or disable Arrow for conversion between Pandas and PySpark dataframes. To use Arrow, ensure that your Transform depends on the pyarrow package.

When calling spark.createDataFrame() with a Pandas dataframe or toPandas(), Spark has to serialize all rows to convert them from one format to the other. For large dataframes this is a slow process and can be the bottleneck for your Transform. When using a Pandas Transform, this serialization happens both when reading and when writing your data.

Arrow is a more efficient serialization format that significantly speeds up this conversion (as reported on the Arrow website ↗).

Kubernetes

The profiles in this family control low-level details of how your Spark job is executed.

When using libraries that are not agnostic to CPU architecture of underlying machines, you can use profiles to force the Spark job to run on a specific architecture. Note that some environments only have access to machines with AMD architecture; jobs that use ARM architecture override will not succeed in those environments.

Profile table

Profile FamilyProfile NameSpark Settings
Driver CoresDRIVER_CORES_SMALLspark.driver.cores: 1
Driver CoresDRIVER_CORES_MEDIUMspark.driver.cores: 2
Driver CoresDRIVER_CORES_LARGEspark.driver.cores: 4
Driver CoresDRIVER_CORES_EXTRA_LARGEspark.driver.cores: 8
Driver CoresDRIVER_CORES_EXTRA_EXTRA_LARGEspark.driver.cores: 16
Driver MemoryDRIVER_MEMORY_SMALLspark.driver.memory: 3g
Driver MemoryDRIVER_MEMORY_MEDIUMspark.driver.memory: 6g; spark.driver.maxResultSize: 4g
Driver MemoryDRIVER_MEMORY_LARGEspark.driver.memory: 13g; spark.driver.maxResultSize: 8g
Driver MemoryDRIVER_MEMORY_EXTRA_LARGEspark.driver.memory: 27g; spark.driver.maxResultSize: 16g
Driver MemoryDRIVER_MEMORY_EXTRA_EXTRA_LARGEspark.driver.memory: 54g; spark.driver.maxResultSize: 32g
Driver Memory OverheadDRIVER_MEMORY_OVERHEAD_SMALLspark.driver.memoryOverhead: 1g
Driver Memory OverheadDRIVER_MEMORY_OVERHEAD_MEDIUMspark.driver.memoryOverhead: 2g
Driver Memory OverheadDRIVER_MEMORY_OVERHEAD_LARGEspark.driver.memoryOverhead: 4g
Driver Memory OverheadDRIVER_MEMORY_OVERHEAD_EXTRA_LARGEspark.driver.memoryOverhead: 8g
Driver Memory OverheadDRIVER_MEMORY_OVERHEAD_EXTRA_EXTRA_LARGEspark.driver.memoryOverhead: 16g
Executor CoresEXECUTOR_CORES_EXTRA_SMALLspark.executor.cores: 1
Executor CoresEXECUTOR_CORES_SMALLspark.executor.cores: 2
Executor CoresEXECUTOR_CORES_MEDIUMspark.executor.cores: 4
Executor CoresEXECUTOR_CORES_LARGEspark.executor.cores: 6
Executor CoresEXECUTOR_CORES_EXTRA_LARGEspark.executor.cores: 8
Executor MemoryEXECUTOR_MEMORY_EXTRA_SMALLspark.executor.memory: 3g; spark.executor.memoryOverhead: 768m
Executor MemoryEXECUTOR_MEMORY_SMALLspark.executor.memory: 6g; spark.executor.memoryOverhead: 1536m
Executor MemoryEXECUTOR_MEMORY_MEDIUMspark.executor.memory: 13g; spark.executor.memoryOverhead: 2g
Executor MemoryEXECUTOR_MEMORY_LARGEspark.executor.memory: 27g; spark.executor.memoryOverhead: 3g
Executor Memory Off-heapEXECUTOR_MEMORY_OFFHEAP_FRACTION_MINIMUMShare of memory to use for off-heap (an “Executor Memory“ profile must be set): 30%
Executor Memory Off-heapEXECUTOR_MEMORY_OFFHEAP_FRACTION_LOWShare of memory to use for off-heap (an “Executor Memory“ profile must be set): 50%
Executor Memory Off-heapEXECUTOR_MEMORY_OFFHEAP_FRACTION_MODERATEShare of memory to use for off-heap (an “Executor Memory“ profile must be set): 70%
Executor Memory Off-heapEXECUTOR_MEMORY_OFFHEAP_FRACTION_HIGHShare of memory to use for off-heap (an “Executor Memory“ profile must be set): 80%
Executor Memory Off-heapEXECUTOR_MEMORY_OFFHEAP_FRACTION_MAXIMUMShare of memory to use for off-heap (an “Executor Memory“ profile must be set): 90%
Executor Memory OverheadEXECUTOR_MEMORY_OVERHEAD_SMALLspark.executor.memoryOverhead: 1g
Executor Memory OverheadEXECUTOR_MEMORY_OVERHEAD_MEDIUMspark.executor.memoryOverhead: 2g
Executor Memory OverheadEXECUTOR_MEMORY_OVERHEAD_LARGEspark.executor.memoryOverhead: 4g
Executor Memory OverheadEXECUTOR_MEMORY_OVERHEAD_EXTRA_LARGEspark.executor.memoryOverhead: 8g
Executor CountKUBERNETES_NO_EXECUTORSspark.kubernetes.local.submission: true; spark.sql.shuffle.partitions: 1
Executor CountNUM_EXECUTORS_1spark.executor.instances: 1; spark.dynamicAllocation.maxExecutors: 1
Executor CountNUM_EXECUTORS_2spark.executor.instances: 2; spark.dynamicAllocation.maxExecutors: 2
Executor CountNUM_EXECUTORS_4spark.executor.instances: 4; spark.dynamicAllocation.maxExecutors: 4
Executor CountNUM_EXECUTORS_8spark.executor.instances: 8; spark.dynamicAllocation.maxExecutors: 8
Executor CountNUM_EXECUTORS_16spark.executor.instances: 16; spark.dynamicAllocation.maxExecutors: 16
Executor CountNUM_EXECUTORS_32spark.executor.instances: 32; spark.dynamicAllocation.maxExecutors: 32
Executor CountNUM_EXECUTORS_64spark.executor.instances: 64; spark.dynamicAllocation.maxExecutors: 64
Executor CountNUM_EXECUTORS_128spark.executor.instances: 128; spark.dynamicAllocation.maxExecutors: 128
Executor CountNUM_EXECUTORS_256spark.executor.instances: 256; spark.dynamicAllocation.maxExecutors: 256
Executor CountNUM_EXECUTORS_512spark.executor.instances: 512; spark.dynamicAllocation.maxExecutors: 512
Task CPU CountTASK_CPUS_2spark.task.cpus: 2
Task CPU CountTASK_CPUS_4spark.task.cpus: 4
Dynamic AllocationDYNAMIC_ALLOCATION_DISABLEDspark.dynamicAllocation.enabled: false
Dynamic AllocationDYNAMIC_ALLOCATION_ENABLEDspark.dynamicAllocation.enabled: true
Dynamic AllocationDYNAMIC_ALLOCATION_MIN_2spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 2
Dynamic AllocationDYNAMIC_ALLOCATION_MIN_4spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 4
Dynamic AllocationDYNAMIC_ALLOCATION_MIN_8spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 8
Dynamic AllocationDYNAMIC_ALLOCATION_MIN_16spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 16
Dynamic AllocationDYNAMIC_ALLOCATION_MAX_8spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.maxExecutors: 8
Dynamic AllocationDYNAMIC_ALLOCATION_MAX_16spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.maxExecutors: 16
Dynamic AllocationDYNAMIC_ALLOCATION_MAX_32spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.maxExecutors: 32
Dynamic AllocationDYNAMIC_ALLOCATION_MAX_64spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.maxExecutors: 64
Dynamic AllocationDYNAMIC_ALLOCATION_MAX_128spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.maxExecutors: 128
Dynamic AllocationDYNAMIC_ALLOCATION_ENABLED_1_2spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 1; spark.dynamicAllocation.maxExecutors: 2
Dynamic AllocationDYNAMIC_ALLOCATION_ENABLED_2_4spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 2; spark.dynamicAllocation.maxExecutors: 4
Dynamic AllocationDYNAMIC_ALLOCATION_ENABLED_4_8spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 4; spark.dynamicAllocation.maxExecutors: 8
Dynamic AllocationDYNAMIC_ALLOCATION_ENABLED_8_16spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 8; spark.dynamicAllocation.maxExecutors: 16
Dynamic AllocationDYNAMIC_ALLOCATION_ENABLED_16_32spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 16; spark.dynamicAllocation.maxExecutors: 32
Dynamic AllocationDYNAMIC_ALLOCATION_ENABLED_32_64spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 32; spark.dynamicAllocation.maxExecutors: 64
Dynamic AllocationDYNAMIC_ALLOCATION_ENABLED_64_128spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 64; spark.dynamicAllocation.maxExecutors: 128
Dynamic AllocationDYNAMIC_ALLOCATION_FAST_SCALE_DOWNspark.dynamicAllocation.executorIdleTimeout: 10s
Dynamic AllocationDYNAMIC_ALLOCATION_SLOW_SCALE_UP_2Mspark.dynamicAllocation.schedulerBacklogTimeout: 2m
Shuffle PartitionsSHUFFLE_PARTITIONS_SMALLspark.sql.shuffle.partitions: 20
Shuffle PartitionsSHUFFLE_PARTITIONS_MEDIUMspark.sql.shuffle.partitions: 200
Shuffle PartitionsSHUFFLE_PARTITIONS_LARGEspark.sql.shuffle.partitions: 2000
Shuffle PartitionsSHUFFLE_PARTITIONS_EXTRA_LARGEspark.sql.shuffle.partitions: 20000
Adaptive Query ExecutionADAPTIVE_ENABLEDspark.sql.adaptive.enabled: true
Adaptive Query ExecutionADAPTIVE_DISABLEDspark.sql.adaptive.enabled: false
Adaptive Query ExecutionADVISORY_PARTITION_SIZE_MEDIUMspark.sql.adaptive.enabled: true; spark.sql.adaptive.shuffle.targetPostShuffleInputSize: 128MB
Adaptive Query ExecutionADVISORY_PARTITION_SIZE_LARGEspark.sql.adaptive.enabled: true; spark.sql.adaptive.shuffle.targetPostShuffleInputSize: 256MB
RPC Message SizeRPC_MESSAGE_MAX_SIZE_512Mspark.rpc.message.maxSize: 512
RPC Message SizeRPC_MESSAGE_MAX_SIZE_1Gspark.rpc.message.maxSize: 1024
RPC Message SizeRPC_MESSAGE_MAX_SIZE_MAXspark.rpc.message.maxSize: 2047
LegacyLEGACY_ALLOW_UNTYPED_SCALA_UDFspark.sql.legacy.allowUntypedScalaUDF: true
LegacyLEGACY_ALLOW_NEGATIVE_DECIMAL_SCALEspark.sql.legacy.allowNegativeScaleOfDecimal: true
LegacyLEGACY_ALLOW_HASH_ON_MAPTYPEspark.sql.legacy.allowHashOnMapType: true
LegacyLEGACY_NAME_NON_STRUCT_GROUPING_KEY_AS_VALUEspark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue: true
LegacyLEGACY_ARRAY_EXISTS_NULL_HANDLINGspark.sql.legacy.followThreeValuedLogicInArrayExists: false
LegacyLEGACY_ALLOW_AMBIGUOUS_SELF_JOINspark.sql.analyzer.failAmbiguousSelfJoin: false
LegacyLEGACY_TIME_PARSER_POLICYspark.sql.legacy.timeParserPolicy: LEGACY
LegacyLEGACY_DATETIME_REBASE_MODEspark.sql.legacy.avro.datetimeRebaseModeInRead: LEGACY; spark.sql.legacy.parquet.datetimeRebaseModeInRead: LEGACY; spark.sql.legacy.avro.datetimeRebaseModeInWrite: LEGACY; spark.sql.legacy.parquet.datetimeRebaseModeInWrite: LEGACY
LegacyLEGACY_FROM_DAYTIME_STRINGspark.sql.legacy.fromDayTimeString.enabled: true
LegacyLEGACY_DATETIME_STRING_COMPARISONspark.sql.legacy.typeCoercion.datetimeToString.enabled: true
Dates & TimesTIME_PARSER_POLICY_CORRECTEDspark.sql.legacy.timeParserPolicy: CORRECTED
Dates & TimesSPARK_ALLOW_INT96_AS_TIMESTAMPspark.sql.parquet.int96AsTimestamp: true
MiscellaneousBUCKET_SORTED_SCAN_ENABLEDspark.sql.sources.bucketing.sortedScan.enabled: true
MiscellaneousLAST_MAP_KEY_WINSspark.sql.mapKeyDedupPolicy: LAST_WIN
MiscellaneousCROSS_JOIN_ENABLEDspark.sql.crossJoin.enabled: true
MiscellaneousSPECULATIVE_EXECUTIONspark.speculation: true
MiscellaneousAUTO_BROADCAST_JOIN_DISABLEDspark.sql.autoBroadcastJoinThreshold: -1
MiscellaneousALLOW_ADD_MONTHSspark.foundry.sql.allowAddMonths: true
MiscellaneousPYSPARK_ROW_FIELD_SORTING_ENABLEDspark.executorEnv.PYSPARK_ROW_FIELD_SORTING_ENABLED: true; spark.yarn.appMasterEnv.PYSPARK_ROW_FIELD_SORTING_ENABLED: true; spark.kubernetes.driverEnv.PYSPARK_ROW_FIELD_SORTING_ENABLED: true
MiscellaneousPYSPARK_ROW_FIELD_SORTING_DISABLEDspark.executorEnv.PYSPARK_ROW_FIELD_SORTING_ENABLED: false; spark.yarn.appMasterEnv.PYSPARK_ROW_FIELD_SORTING_ENABLED: false; spark.kubernetes.driverEnv.PYSPARK_ROW_FIELD_SORTING_ENABLED: false
MiscellaneousPYSPARK_ROW_SCHEMA_CORRUPTION_CHECK_DISABLEDspark.kubernetes.driverEnv.PYSPARK_CHECK_ROW_SCHEMA_CORRUPTION: false; spark.yarn.appMasterEnv.PYSPARK_CHECK_ROW_SCHEMA_CORRUPTION: false; spark.executorEnv.PYSPARK_CHECK_ROW_SCHEMA_CORRUPTION: false
MiscellaneousSPARK_KYRO_REFERENCE_TRACKING_DISABLEDspark.kryo.referenceTracking: false
MiscellaneousGEOSPARKspark.foundry.build.stats.enabled: false
MiscellaneousSPARK_REFERENCE_TRACKING_DISABLEDspark.cleaner.referenceTracking: false
ArrowARROW_ENABLEDspark.sql.execution.arrow.enabled: true; spark.sql.execution.arrow.pyspark.enabled: true; spark.sql.execution.arrow.sparkr.enabled: true; spark.sql.execution.arrow.fallback.enabled: true; spark.sql.execution.arrow.pyspark.fallback.enabled: true
ArrowARROW_DISABLEDspark.sql.execution.arrow.enabled: false; spark.sql.execution.arrow.pyspark.enabled: false; spark.sql.execution.arrow.sparkr.enabled: false
KubernetesKUBERNETES_CPU_ARCHITECTURE_OVERRIDE_AMD64N/A
KubernetesKUBERNETES_CPU_ARCHITECTURE_OVERRIDE_ARM64N/A