Apache Spark

A cluster computing framework for large-scale data processing

Homepage Version(s): 2.2.1

Apache Spark requires an environment module

In order to use Apache Spark, you must first load the appropriate environment module:

module load spark

Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark code-base was later donated to the Apache Software Foundation, which has maintained it since. Spark is becoming a popular tool for data analytics.

Using Apache Spark on RCC Resources#

Spark supports the Python (PySpark), R (SparkR), Java, and Scala programming languages. There are a number of official examples available for how to use Spark in Python, Java, and Scala.

Below is an example Slurm script to submit a Spark job. The script must be saved with the .sh extension. First, download the example file pi.py here.

#!/bin/bash
#SBATCH -N 2
#SBATCH -t 01:00:00
#SBATCH --ntasks-per-node 3
#SBATCH --cpus-per-task 5
# Load the spark module
module load spark
# Start the spark cluster
spark-start.sh
echo $MASTER
# (2 nodes * 5 cpus-per-task * 3 tasks-per-node) = 30 total cores
spark-submit --total-executor-cores 30 --executor pi.py

The spark module will set up the necessary environment variables and the spark-start.sh script will set up the spark cluster within the job allocation.

Submit your script using the following command, replacing YOURSCRIPT with the name of your script file:

1	`$ sbatch YOURSCRIPT.sh`