Hadoop2 Cluster Setup

This document guideline the steps to Hadoop2 Cluster Setup.

Table of Contents

Pre-requisites: 1

Create user and group (hadoop) on all hosts: 1
setup SSH for hadoop user. 1
Download and install Java. 2
Download hadoop: 2
Update bashrc file for hadoop user to set required path variables. 2
Update Hadoop Configuration files: 3
Create folder structure and copy the hadoop folder to all the host. 5
Format name node. 6
Start HDFS/YARN.. 6

Cluster Setup Details:

Host Name	Configuration	Services
nn1.hadoop.com	VM, Centos 6.2, Ram:3.5 GB	Name Node, Yarn Resource Manager, Kerberos KDC server
dn1.hadoop.com	VM, Centos 6.2, Ram:2 GB	Secondary Name Node, History server, Data Node, Node manager
dn2.hadoop.com	VM, Centos 6.2, Ram:2 GB	Data Node, Node manager

for VM setup please refer the VM setup for Hadoop Document.

Pre-requisites:

1. Create user and group (hadoop) on all hosts:

groupadd hadoop (to create group)

adduser hadoop (to create a user or alternatively you can run “sudo useradd hadoop -g hadoop” which will create the user hadoop and add it group hadoop as well)

passwd hadoop (to change the password of user hadoop)

sudo useradd user1 -g hadoop

(or gpasswd -a hadoop root (add user to root group))

To change group: usermod -g hadoop user1

Add user hadoop to visudo file, to grant sudo privileges (this will help to perform a lot of activity required to perform as root user)

visudo (in this file under root, add user name eg. “user1 ALL=(ALL) ALL” ) (Add user under sudoers list)

[root@nn1 ~]# visudo

Add entry like below

## Allow root to run any commands anywhere

root    ALL=(ALL)       ALL
user1   ALL=(ALL)       NOPASSWD: ALL
hadoop ALL=(ALL)       NOPASSWD: ALL

setup SSH for hadoop user.

# ssh-keygen -t rsa (hit enter for all prompt to keep default path and empty password)

# cd .ssh

# cat id_rsa.pub >> authorized_keys (setup ssh to same host)

now copy the id_rsa.pub file to all host.

# ssh-copy-id –I .ssh/id_rsa.pub dn1

# ssh-copy-id –I .ssh/id_rsa.pub dn2

3. Download and install Java

To check if it is already present use “ java –version” command.

if not present, you can download from following path (for Jdk1.7 for any other version please check Oracle site http://java.sun.com/javase/downloads/index.jsp ):

# wget –no-cookies –no-check-certificate –header “Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie” “http://download.oracle.com/otn-pub/java/jdk/7u79-b15/jdk-7u79-linux-x64.rpm”

Or down load the rpm on your host machine and copy it to VM using winscp.

And run command:

# rpm -Uvh jdk-7u79-linux-x64.rpm

# alternatives –install /usr/bin/java java /usr/java/latest/bin/java 2

In case you get error: /lib/ld-linux.so.2: bad ELF interpreter: No such file or directory

run following

# yum -y install glibc.i686

3. Download hadoop:

# curl -o “http://www.trieuvan.com/apache/hadoop/common/stable/hadoop-2.7.1.tar.gz”

# tar -xzf hadoop-2.7.1.tar.gz

copy the content to /opt/hadoop (which we will use as our base directory)

cp -r /home/user1/hadoop-2.7.1/* /opt/hadoop/

5. Update bashrc file for hadoop user to set required path variables.

export JAVA_HOME=/usr/java/latest/
export PATH=$JAVA_HOME/bin:$PATH
export HADOOP_COMMON_HOME=/opt/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_COMMON_HOME
export HADOOP_HDFS_HOME=$HADOOP_COMMON_HOME
export YARN_HOME=$HADOOP_COMMON_HOME
export PATH=$PATH:$HADOOP_COMMON_HOME/bin
export PATH=$PATH:$HADOOP_COMMON_HOME/sbin
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar

6. Update Hadoop Configuration files:

For Hadoop2 configuration files will present under …..<hadoop_home>/etc/hadoop/

Update all the below mentioned configuration files and required properties within the configuration tags.

e.g. all xml based configuration file should look like below.

<?xml version=”1.0″ encoding=”UTF-8″?>

<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
.
.
List of properties….
.
.
</configuration>

Note: Ensure all the tags of start close tag else will throw error message during service start operation.

Add following property in core-site.xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://NN1.hadoop.com/</value>
<description>NameNode URI</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/data/tmp</value>
</property>

———————————

Add following property in yarn-site.xml
<property>
<name>yarn.resourcemanager.hostname</name>
<value>NN1.hadoop.com</value>
<description>The hostname of the ResourceManager</description>
</property>

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
<description>shuffle service for MapReduce</description>
</property>

<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>1536</value>
</property>

<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>768</value>
</property>

———————————————

Add following property in hdfs-site.xml

<property>
<name>dfs.datanode.data.dir</name>
<value>file:///data/hdfs/dn/</value>
<description>DataNode directory for storing data chunks.</description>
</property>

<property>
<name>dfs.namenode.name.dir</name>
<value>file:///data/hdfs/nn</value>
<description>NameNode directory for namespace and transaction logs storage.
</description>
</property>

<property>
<name>dfs.replication</name>
<value>2</value>
<description>Number of replication for each chunk.</description>
</property>

<property>
<name>dfs.secondary.http.address</name>
<value>NN2.hadoop.com:50090</value>
<description>SecondaryNameNodeHostname</description>
</property>

—————-

Add following property in mapred-site.xml

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

—————-

Update slaves file

dn1.hadoop.com
dn2.hadoop.com

Update masters and add entry of secondary name node.

NN2.hadoop.com

———

Add below entry in Secondary name node hdfs-site.xml (perform this step once all Hadoop binaries and configuration files has been copied to all the hosts)

<property>
<name>dfs.http.address</name>
<value>NN1.hadoop.com:50070<value>
<description> The address and the base port where the dfs namenode web ui will listen on.
If the port is 0 then the server will start on a free port.
</description>
</property>

————–

7. Create folder structure and copy the hadoop folder to all the host.

On nn1:

mkdir -p /data/hdfs/nn

chown -R hadoop:hadoop /data

We need to create the similar folder structure on remaining hosts as well, we can do that individually on all host, or by writing a small script as below.

for i in dn1 dn2
do ssh $i “hostname -f; mkdir /opt/hadoop; chown -R hadoop:hadoop /opt/hadoop;chmod -R 770 /opt/hadoop/; mkdir -p /data/hdfs/dn; chown -R hadoop:hadoop /data; chmod -R 770 /data/hdfs/dn; ls -lrt /opt”
done

Once folder structure have been created, we need to copy the files. We can do that using SCP or using rsync. I prefer rsync as it compress the data during transfer and it copies all the folder and files for first time; from next time on wards it will only copy the files which has been modified.

e.g.

for i in `cat /opt/hadoop/hadoop-2.7.1/etc/hadoop/slaves`;
do
echo $i; rsync -avxP –exclude=logs /opt/hadoop/hadoop-2.7.1/* $i:/opt/hadoop/hadoop-2.7.1/;
echo ” sync completed for host $i”
done

in secure mode:

chmod 770 /data/
chown -R hdfs:hadoop /data/hdfs/
chown -R hadoop:hadoop /opt/hadoop/

===================================

8. Format name node

hdfs namenode –format

# hdfs namenode –format

That’s it, we are done with the setup activity, now we can start the services.

9. Start HDFS/YARN

All the scripts to start services are present under …./<hadoop_home>/opt/hadoop/sbin/”

script to start all services: start-all.sh

script to start all HDFS services: start-dfs.sh

script to start all Yarn services: start-yarn.sh

Similarly to Stop:

stop-all.sh

stop-yarn.sh

stop-dfs.sh

OR to start services manually one at a time on each host (which I will recommend to do for the first time, it will also help to debug and trouble shoot startup process, once all the services are able to start successfully use the above scripts):

Start the HDFS with the following command, run on the designated NameNode:

(run this on nn1)

# hadoop-daemon.sh start namenode

#hadoop-daemon.sh start secondarynamenode (run on secondary name node i.e. dn1)

# hadoop-daemon.sh start secondarynamenode

Run the script to start DataNodes on each slave (i.e. dn1, dn2):

# hadoop-daemon.sh start datanode

Start the YARN with the following command, run on the designated ResourceManager (i.e. nn1):

# yarn-daemon.sh start resourcemanager

To start NodeManagers on all slaves (i.e. dn1, dn2):

# yarn-daemon.sh start nodemanager

To start Job history server:

# mr-jobhistory-daemon.sh start historyserver

—–

To change log level update the log4j.properties and add below property.

log4j.logger.org.apache.hadoop.util.NativeCodeLoader=ERROR

——-

To verify HDFS Namenode URI:

http://nn1.hadoop.com:50070/

Yarn resource manager UI:

http://nn1.hadoop.com:8088/

Some checks:

verify hadoop cluster is functioning properly.

Test 1: create director and load files.

# hdfs dfs –mkdir /test

# hdfs dfs –ls /

Create a file “test_file.txt” on current directory and add few lines.

Now copy load to HDFS :

Hdfs dfs –put test_file.txt /test

To read the file content from HDFS:

Hdfs dfs –cat /test/test_file.txt

Some advanced HDFS commands:

Health check of HDFS files:

hadoop fsck / -files -blocks -locations

To set/override replication of an individual file (lets say 4):

# hadoop dfs -setrep -w 4 /path/to/file

You can also do this recursively. To change replication of entire HDFS to 1:

# hadoop dfs -setrep -R -w 1 /

—————-

To clean setup (make sure you have take the backup of configuration files if needed ):

for i in nn1 dn1 dn2

do ssh $i “hostname -f; rm -rf /opt/hadoop; rm -rf /data/hdfs/; rm -rf /data/tmp”

done

Hadoop2 Cluster Setup