Apache Spark
A cluster computing framework for large-scale data processing
Apache Spark requires an environment module
In order to use Apache Spark, you must first load the appropriate environment module:
module load spark
Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark code-base was later donated to the Apache Software Foundation, which has maintained it since. Spark is becoming a popular tool for data analytics.
Using Apache Spark on RCC Resources#
Spark supports the Python (PySpark), R (SparkR), Java, and Scala programming languages. There are a number of official examples available for how to use Spark in Python, Java, and Scala.
Below is an example Slurm script to submit a Spark job. The script must be saved with the .sh
extension. First,
download the example file pi.py
here.
The spark
module will set up the necessary environment variables and the spark-start.sh
script will set
up the spark cluster within the job allocation.
Submit your script using the following command, replacing YOURSCRIPT
with the name of your script file: