What is Spark?
Spark is a distributed computing system that is used within Foundry to run data transformations at scale. It was originally created by a team of researchers at UC Berkeley and was subsequently donated to the Apache Foundation in the late 2000s. Foundry allows you to run SQL, Python, Java, and Mesa transformations (Mesa is a proprietary Java-based DSL) on large amounts of data using Spark as the foundational computation layer.
How does Spark work?
Spark relies on distributing jobs across many computers at once to process data. This process allows for simultaneous jobs to run quickly across users and projects with a method known as MapReduce. These computers are broken down between drivers and executors.
EXECUTOR_MEMORY_SMALL
to EXECUTOR_MEMORY_MEDIUM
then running the job again before adjusting anything else. This ensures you don't incur unnecessary costs by over-allocating resources to your job.EXECUTOR_CORES_SMALL
, EXECUTOR_MEMORY_SMALL, DRIVER_CORES_SMALL, DRIVER_MEMORY_SMALL, NUM_EXECUTORS_2NUM_EXECUTORS_32
and EXECUTOR_MEMORY_LARGE
(and above) should be available only upon request and approval of that request.EXECUTOR_CORES_SMALL
should be heavily controlled (because this is a stealth way to increase computing power and we prefer to funnel users to NUM_EXECUTORS profiles in almost all cases).EXECUTOR_CORES_SMALL
and EXECUTOR_MEMORY_MEDIUM
) should be approved by an administrator. Block off EXECUTOR_CORES_EXTRA_SMALL
and EXECUTOR_MEMORY_LARGE
. It’s usually either a sign of suboptimal optimization or a critical workflow if a user is asking for these.