Hadoop Yarn setup

Halvade runs on the Hadoop Yarn framework with Apache Spark, if Hadoop MapReduce version 2.7 or newer is already installed on your cluster, you can continue to the Hadoop configuration section to make sure the advised configuration is enabled. Halvade uses GATK 4 or later, which requires a specific version of Java, currently version 1.8. To make sure GATK works as expected the correct version of Java needs to be installed on every node in the cluster and set as the default Java instance, in Ubuntu:

1
2
3
sudo apt-get install openjdk-8-jre-headless
# to configure this as the default use
sudo update-alternatives --config java

Automated install with CDH5

For the Hadoop installation on a multi node cluster, we refer to the manual given by Cloudera to install CDH 5 or later and configure the Hadoop cluster. You can find a detailed description online here.

Manual installation

To run Hadoop on one or more nodes, a single node must be set as the master node. This is the node you want to connect to when you start hadoop or spark jobs and in this tutorial the hostname of this node will be masternode. The following instructions are based on this tutorial and can be used for additional information. Hadoop requires ssh and rsync to run, to install these on your system, run these commands (on Ubuntu):

1
sudo apt-get install ssh rsync

It is advised a specific user is created for all YARN/Spark jobs. Here we create the hadoop user, this user should be added on every node in the cluster.

1
2
sudo useradd -s /bin/bash -d /hadoop -U -m  hadoop # where /hadoop is the home directory
sudo passwd hadoop # sets the password for the hadoop user

The next sections will assume that you are logged into a terminal as the hadoop user, you can do this in your current terminal using su hadoop. We need to setup passwordless ssh for this user, as starting the yarn services requires this to start the services on the slave nodes.

1
2
3
4
ssh-keygen -b 4096
cat ~/.ssh/id_rsa.pub  >> ~/.ssh/authorized_keys
# on all other slave nodes:
ssh masternode 'cat ~/.ssh/id_rsa.pub' >> ~/.ssh/authorized_keys

Next we download and unzip the Hadoop distribution (here 2.10.1):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
cd # change working dir to hadoops home directory (/hadoop in the tutorial)
HADOOP_V=2.10.1
wget https://downloads.apache.org/hadoop/common/hadoop-${HADOOP_V}/hadoop-${HADOOP_V}.tar.gz
tar -xzf hadoop-${HADOOP_V}.tar.gz
ln -s hadoop-${HADOOP_V} hadoop # create a link hadoop to the actual folder, not necessary but makes it easier to switch versions later on.
rm -f hadoop-${HADOOP_V}.tar.gz

# add environment variables to the hadoop users ~/.bashrc
echo "export HADOOP_HOME=/hadoop/hadoop" >> ~/.bashrc
echo "export HADOOP_PREFIX=/hadoop/hadoop" >> ~/.bashrc
echo "export HADOOP_COMMON_HOME=/hadoop/hadoop" >> ~/.bashrc
echo "export HADOOP_HDFS_HOME=/hadoop/hadoop" >> ~/.bashrc
echo "export HADOOP_MAPRED_HOME=/hadoop/hadoop" >> ~/.bashrc
echo "export HADOOP_YARN_HOME=/hadoop/hadoop" >> ~/.bashrc
echo "export YARN_HOME=/hadoop/hadoop" >> ~/.bashrc
echo "export HADOOP_STREAMING=\$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.8.5.jar" >> ~/.bashrc
echo "export PATH=\${PATH}:\${HADOOP_HOME}/bin:\${HADOOP_HOME}/sbin" >> ~/.bashrc

Hadoop configuration

For hadoop Yarn to run we need to set some basic configurations. These files are located in the $HADOOP_HOME/etc/hadoop folder. The next few files needs to be adjusted or updated as required. If the hostnames in configuration files are used (i.e. masternode) but there is no active DNS system on your cluster, you should replace the host names with the ip address in the network.

hadoop-env.sh:

1
2
# your current java homedir can be found with this command (remove the /bin/java): update-alternatives --display java | grep current
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre

core-site.xml:

1
2
3
4
5
6
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://masternode:9000</value>
    </property>
</configuration>

The fs.defaultFS can also be set to file:///path/to/dfs/ if you have a shared filesystem on the nodes (i.e. GPFF or NFS) and do not want to use HDFS.

hdfs-site.xml:

This sets the locations the data on HDFS will be stored. If your / is not your main data disk, you should change these directories. This should only be set if you use HDFS.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
<configuration>
        <property>
          <name>dfs.namenode.name.dir</name>
          <value>/hadoop/data/nameNode</value>
        </property>

        <property>
          <name>dfs.datanode.data.dir</name>
          <value>/hadoop/data/dataNode</value>
        </property>

        <property>
          <name>dfs.replication</name>
          <value>3</value>
        </property>
</configuration>

yarn-site.xml:

The spark_shuffle service allows to dynamically start all executors that fit in the resources of the nodes. It is important to set the correct yarn_shuffle.jar file in the option yarn.nodemanager.aux-services.spark_shuffle.classpath, this can be found with ls $SPARK_HOME/yarn/spark*-yarn-shuffle.jar.

These are configurations for a 16 CPU machine with 64 GBytes of RAM, adjust accordingly.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
<configuration>
  <property>
    <name>yarn.acl.enable</name>
    <value>0</value>
  </property>

  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>masternode</value>
  </property>

  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle,spark_shuffle</value>
  </property>

        <property>
    <name>yarn.nodemanager.aux-services.spark_shuffle.classpath</name>
    <value>/hadoop/spark/yarn/spark-3.0.1-yarn-shuffle.jar</value>  <!-- set to correct yarn-shuffle.jar file: ls $SPARK_HOME/yarn/spark*-yarn-shuffle.jar -->
  </property>

  <property>
    <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
    <value>org.apache.spark.network.yarn.YarnShuffleService</value>
  </property>

  <property>
    <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>

  <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>65536</value>
  </property>

  <property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>65536</value>
  </property>

  <property>
    <name>yarn.nodemanager.resource.cpu-vcores</name>
    <value>16</value>
  </property>

  <property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>128</value>
  </property>

  <property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
  </property>

  <property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
  </property>

  <property>
    <name>yarn.app.mapreduce.shuffle.log.separate</name>
    <value>true</value>
  </property>

  <property>
    <name>yarn.nodemanager.local-dirs</name>
    <value>/hadoop/data/cache/${user.name}/nm-local-dir</value>
  </property>

  <property>
    <name>yarn.log.server.url</name>
    <value>http://masternode:19888/jobhistory/logs</value>
  </property>

    <property>
     <name>yarn.application.classpath</name>
     <value>$SPARK_HOME/jars/*,$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/share/hadoop/common/*,$HADOOP_COMMON_HOME/share/hadoop/common/lib/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*,$HADOOP_YARN_HOME/share/hadoop/yarn/*,$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*
     </value>
  </property>
</configuration>

slaves:

1
2
3
masternode
slavenode1
salvenode2

This contains a list of slaves that will run the node manager service, these are the worker nodes. In hadoop 3.0.3 and newer this is replaced by a workers file.

Copy the configuration

This hadoop folder with the edited configuration has to be copied to the other slavenodes. This can be done as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# on the masternode
HADOOP_V=2.10.1
tar cvzf hadoop-${HADOOP_V}.tar.gz hadoop-${HADOOP_V}
scp hadoop-${HADOOP_V}.tar.gz slavenode1:/hadoop
scp hadoop-${HADOOP_V}.tar.gz slavenode2:/hadoop # and any other nodes
rm hadoop-${HADOOP_V}.tar.gz # removes the tar if you copied it to all slaves

# on every slavenode run this as the hadoop user
cd /hadoop
HADOOP_V=2.10.1
tar -xzf hadoop-${HADOOP_V}.tar.gz
ln -s hadoop-${HADOOP_V} hadoop
rm hadoop-${HADOOP_V}.tar.gz

Start the services

On the masternode you can now start the hadoop services as follows, this assumes that the $HADOOP_HOME/sbin folder is added to your $PATH else use the full path to the script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# before you start the dfs service for the first time the hdfs needs to be formatted like this:
hdfs namenode -format

# start dfs / only if you use hdfs
start-dfs.sh

# start yarn
start-yarn.sh

# start the history server:
mr-jobhistory-daemon.sh start historyserver

To stop the services use these commands:

1
2
3
stop-dfs.sh # only if you use hdfs
stop-yarn.sh
mr-jobhistory-daemon.sh stop historyserver