Apache Spark setup¶

In order to use Halvade, Apache Spark needs to be installed. We assume that the $installpath is the folder where you already installed hadoop yarn which will be used as master for the spark cluster. These commands should be run on the user that will run the Spark jobs, as we did with Hadoop we advise to create a user specifically for this purpose. For Hadoop we created the hadoop user and we will use this user for Spark jobs as well. This will install Spark 2.4.5 for Hadoop versions 2.7 and later.

# run this as the user that will run spark which is `hadoop` in this tutorial
installpath=/hadoop
cd $installpath
SPARK_V=3.1.1
wget https://archive.apache.org/dist/spark/spark-${SPARK_V}/spark-${SPARK_V}-bin-hadoop2.7.tgz
tar -xf spark-${SPARK_V}-bin-hadoop2.7.tgz
ln -s spark-${SPARK_V}-bin-hadoop2.7 spark
rm -f spark-${SPARK_V}-bin-hadoop2.7.tgz

# add environment to the bashrc
echo "PATH=/hadoop/spark/bin:\$PATH" >> ~/.bashrc
echo "export HADOOP_CONF_DIR=$installpath/hadoop/etc/hadoop" >> ~/.bashrc
echo "export SPARK_HOME=$installpath/spark" >> ~/.bashrc
echo "export LD_LIBRARY_PATH=$installpath/hadoop/lib/native:\$LD_LIBRARY_PATH" >> ~/.bashrc
echo "export PATH=\$SPARK_HOME/bin:\$PATH" >> ~/.bashrc

# relog to load the env variables or source ~/.bashrc and create the configuration file from the template
mv $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.conf

Configuration¶

To configure the default values in Spark, we use edit the $SPARK_HOME/conf/spark-defaults.conf file. All these can be changed at runtime in the spark-submit command. But it might be a good idea to set some default values for when other spark jobs are started. Replace the hdfs://masternode:9000/ with the appropriate hdfs moniker.

spark.master  yarn

# add if you want logs
spark.eventLog.enabled  true
# set directory, can be on hdfs (hdfs://masternode:9000/path/to/logs) or any other distributed fs (/path/to/logs)
spark.eventLog.dir hdfs://masternode:9000/spark-logs

# add these if you want a history server
spark.history.provider            org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.update.interval  20s
spark.history.ui.port             18080
# point to where logs are stored, can be on hdfs (hdfs://masternode:9000/path/to/logs) or any other distributed fs (/path/to/logs)
spark.history.fs.logDirectory     hdfs://master_hostname:9000/spark-logs

If you decide to enable spark logging and the history server, you can create the hdfs log directory if needed and start the history server like this:

# create the directory on hdfs if needed
hdfs dfs -mkdir /spark-logs

# start history server like this
$SPARK_HOME/sbin/start-history-server.sh

Run Spark¶

Now that Spark has been installed, you can run spark in two ways, interactively with spark-shell and as a batch job spark-submit. This will by default use the YARN cluster now as we set this in the configuration file.