Skip to content

Apache Spark

A cluster computing framework for large-scale data processing

Apache Spark requires an environment module

In order to use Apache Spark, you must first load the appropriate environment module:

module load spark

Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark code-base was later donated to the Apache Software Foundation, which has maintained it since. Spark is becoming a popular tool for data analytics.

Using Apache Spark on RCC Resources#

Spark supports the Python (PySpark), R (SparkR), Java, and Scala programming languages. There are a number of official examples available for how to use Spark in Python, Java, and Scala.

Below is an example Slurm script to submit a Spark job. The script must be saved with the .sh extension. First, download the example file here.

#SBATCH -t 01:00:00
#SBATCH --ntasks-per-node 3
#SBATCH --cpus-per-task 5
# Load the spark module
module load spark
# Start the spark cluster
echo $MASTER
# (2 nodes * 5 cpus-per-task * 3 tasks-per-node) = 30 total cores
spark-submit --total-executor-cores 30 --executor

The spark module will set up the necessary environment variables and the script will set up the spark cluster within the job allocation.

Submit your script using the following command, replacing YOURSCRIPT with the name of your script file:

$ sbatch