Apache Spark setup¶
In order to use Halvade, Apache Spark needs to be installed. We assume that the $installpath is the folder where you already installed hadoop yarn which will be used as master for the spark cluster. These commands should be run on the user that will run the Spark jobs, as we did with Hadoop we advise to create a user specifically for this purpose. For Hadoop we created the hadoop user and we will use this user for Spark jobs as well. This will install Spark 2.4.5 for Hadoop versions 2.7 and later.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | # run this as the user that will run spark which is `hadoop` in this tutorial installpath=/hadoop cd $installpath SPARK_V=3.1.1 wget https://archive.apache.org/dist/spark/spark-${SPARK_V}/spark-${SPARK_V}-bin-hadoop2.7.tgz tar -xf spark-${SPARK_V}-bin-hadoop2.7.tgz ln -s spark-${SPARK_V}-bin-hadoop2.7 spark rm -f spark-${SPARK_V}-bin-hadoop2.7.tgz # add environment to the bashrc echo "PATH=/hadoop/spark/bin:\$PATH" >> ~/.bashrc echo "export HADOOP_CONF_DIR=$installpath/hadoop/etc/hadoop" >> ~/.bashrc echo "export SPARK_HOME=$installpath/spark" >> ~/.bashrc echo "export LD_LIBRARY_PATH=$installpath/hadoop/lib/native:\$LD_LIBRARY_PATH" >> ~/.bashrc echo "export PATH=\$SPARK_HOME/bin:\$PATH" >> ~/.bashrc # relog to load the env variables or source ~/.bashrc and create the configuration file from the template mv $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.conf |
Configuration¶
To configure the default values in Spark, we use edit the $SPARK_HOME/conf/spark-defaults.conf file. All these can be changed at runtime in the spark-submit command. But it might be a good idea to set some default values for when other spark jobs are started. Replace the hdfs://masternode:9000/ with the appropriate hdfs moniker.
1 2 3 4 5 6 7 8 9 10 11 12 13 | spark.master yarn # add if you want logs spark.eventLog.enabled true # set directory, can be on hdfs (hdfs://masternode:9000/path/to/logs) or any other distributed fs (/path/to/logs) spark.eventLog.dir hdfs://masternode:9000/spark-logs # add these if you want a history server spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider spark.history.fs.update.interval 20s spark.history.ui.port 18080 # point to where logs are stored, can be on hdfs (hdfs://masternode:9000/path/to/logs) or any other distributed fs (/path/to/logs) spark.history.fs.logDirectory hdfs://master_hostname:9000/spark-logs |
If you decide to enable spark logging and the history server, you can create the hdfs log directory if needed and start the history server like this:
1 2 3 4 5 | # create the directory on hdfs if needed hdfs dfs -mkdir /spark-logs # start history server like this $SPARK_HOME/sbin/start-history-server.sh |
Run Spark¶
Now that Spark has been installed, you can run spark in two ways, interactively with spark-shell and as a batch job spark-submit. This will by default use the YARN cluster now as we set this in the configuration file.