Spark acceleration is a technique that leverages low-level hardware optimizations to improve the performance of Spark jobs. By using platform-specific features, native acceleration aims to significantly reduce the time it takes to process large-scale data workloads, which can result in faster job execution and improved resource utilization.
Velox ↗ is a reusable, high-performance, low-level data processing library that provides a set of primitives for building high-performance data processing systems. It is designed to be used as a foundation for building higher-level data processing systems, and it is used in Foundry to accelerate Spark jobs.
Read more about native acceleration in Foundry.
Spark acceleration can be used on any existing Spark pipeline. You do not need to make any changes to your logic.
To use native acceleration in your Python transform pipeline, you must complete the following:
VELOX
backend, as shown in the following code snippet:Copied!1 2 3 4 5 6 7 8 9 10 11 12
from transforms.api import configure, ComputeBackend, Input, Output, transform_df @configure( ["EXECUTOR_MEMORY_MEDIUM", "EXECUTOR_MEMORY_OFFHEAP_FRACTION_HIGH"], backend=ComputeBackend.VELOX) @transform_df( Output('/Project/folder/output'), source_df=Input('/Project/folder/input'), ) def compute(source_df): ...
To optimize your natively accelerated Spark project, start by using the EXECUTOR_MEMORY_OFFHEAP_FRACTION_HIGH
setting for off-heap memory. This memory is used by Velox, which handles some tasks outside the JVM. Observe the performance, and adjust the off-heap memory up or down as needed.
To use a fractional off-heap profile, you must also set an EXECUTOR_MEMORY_X profile. Your job likely already has this.