A cluster computing framework for large-scale data processing
Apache Spark requires an environment module
In order to use Apache Spark, you must first load the appropriate environment module:
module load spark
Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark code-base was later donated to the Apache Software Foundation, which has maintained it since. Spark is becoming a popular tool for data analytics.
Using Apache Spark on RCC Resources#
Spark supports the Python (PySpark), R (SparkR), Java, and Scala programming languages. There are a number of official examples available for how to use Spark in Python, Java, and Scala.
Below is an example Slurm script to submit a Spark job. The script must be saved with the
.sh extension. First,
download the example file
spark module will set up the necessary environment variables and the
spark-start.sh script will set
up the spark cluster within the job allocation.
Submit your script using the following command, replacing
YOURSCRIPT with the name of your script file: