Set up 3 VM instances on Softlayer- master, slave1, slave2.
Please add your public key while provisioning the VMs (slcli vs create ... --key KEYID
) so that you can login
from your PC/Mac/laptop without a password. You could use UBUNTU_LATEST_64 or REDHAT_LATEST_64 while
provisioning (although ubuntu is a little cheaper).
Get 2 CPUs, 4G of RAM, 1G private / public NICS and two disks: 25G and 100G local the idea is to use the 100G disk for HDFS data and the 25G disk for the OS and housekeeping.
- Login into VMs (all 3 of them) and update
for instance (add your own private IP addresses): master slave1 slave2
- You need to find out the name of your disk, e.g
fdisk -l |grep Disk |grep GB
cat /proc/partitions
Assuming your disk is called /dev/xvdc
as it is for me,
mkdir -m 777 /data
mkfs.ext4 /dev/xvdc
- Add this line to
(with the appropriate disk path):
/dev/xvdc /data ext4 defaults,noatime 0 0
- Mount your disk
mount /data
- Install JDK
apt-get install -y default-jre default-jdk
yum install -y java-1.6.0-openjdk*
- Install nmon
apt-get install -y nmon
yum install -y nmon
Download the files into /usr/local
and extract it
cd /usr/local
tar xzf hadoop-2.6.0.tar.gz
mv hadoop-2.6.0 hadoop
- Create a user hadoop (all 3 nodes)
adduser hadoop
- Make sure your key directories have correct permissions
chown -R hadoop.hadoop /data
chown -R hadoop.hadoop /usr/local/hadoop
- Add the public key (in
) to~hadoop/.ssh/authorized_keys
on master
cat ~/.ssh/ >> ~/.ssh/authorized_keys
- Copy the key pair and authorized hosts from the root user to the Hadoop user
mv ~hadoop/.ssh{,-old}
cp -a ~/.ssh ~hadoop/.ssh
chown -R hadoop ~hadoop/.ssh
- Setup passwordless ssh from hadoop@master to hadoop@master, hadoop@slave1 and hadoop@slave2 by copying the files in
between them. From your workstation, substituting MASTER-IP, etc with your VM IPs
First accept all keys
for IP in MASTER-IP SLAVE1-IP SLAVE2-IP; do ssh-keyscan -H $IP >> ~/.ssh/known_hosts; done
Then copy the files around, preserving permissions
ssh root@MASTER-IP 'tar -czvp /home/hadoop/.ssh' | ssh root@SLAVE1-IP 'cd /; tar -xzvp'
ssh root@SLAVE1-IP 'tar -czvp /home/hadoop/.ssh' | ssh root@SLAVE2-IP 'cd /; tar -xzvp'
- Test your work by trying to ssh from user hadoop, on master to master (itself), slave1 and slave2. You should issue commands and see output like this:
The administrative scripts use ssh to start the namenode(s), tasktrackers and datanodes on each, including master
root@master:~# su - hadoop
hadoop@master:~$ ssh master
The authenticity of host 'master (' can't be established.
ECDSA key fingerprint is 98:da:1e:45:cf:d4:6f:51:b4:c3:ee:fe:d8:1c:ee:ad.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'master,' (ECDSA) to the list of known hosts.
hadoop@master:~$ logout
Connection to slave1 closed.
root@master:~# su - hadoop
hadoop@master:~$ ssh slave1
The authenticity of host 'slave1 (' can't be established.
ECDSA key fingerprint is 98:da:1e:45:cf:d4:6f:51:b4:c3:ee:fe:d8:1c:ee:ad.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'slave1,' (ECDSA) to the list of known hosts.
hadoop@slave1:~$ logout
Connection to slave1 closed.
hadoop@master:~$ ssh slave2
The authenticity of host 'slave2 (' can't be established.
ECDSA key fingerprint is 17:b3:54:cc:b8:ef:82:9c:e3:cd:43:91:1c:15:c7:e6.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'slave2,' (ECDSA) to the list of known hosts.
hadoop@slave2:~$ logout
Connection to slave2 closed.
You should do this step to avoid problems starting the cluster, and to add the slave nodes to the known hosts
- From now on, you're working as user hadoop.
su – hadoop
- Add this line to the end of
in hadoop's home directory:
export PATH=$PATH:/usr/local/hadoop/bin
- Source these changes in to the open shell
source .profile
- Go to the hadoop home directory
cd /usr/local/hadoop/etc/hadoop
- Change
files (on the master node only)
In the ./masters
file, list your master by name
- In the
file, list your slaves by name, one per line.```
We need to edit the following configuration files in /usr/local/hadoop/etc/hadoop
(Add the java home for your freshly installed java, e.g on ubuntu):
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/jre
(Add the java home here too)
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/jre
Add this configuration, make sure to use the name of the master node in the value.
The dfs.replication
value tells Hadoop how many copies of the data
it should make. With 3 nodes, we can set it to 3.
- Copy all your files to the other machines since you need this configuration on all the nodes:
scp –r /usr/local/hadoop/etc/hadoop/* hadoop@slave1:/usr/local/hadoop/etc/hadoop/
scp –r /usr/local/hadoop/etc/hadoop/* hadoop@slave2:/usr/local/hadoop/etc/hadoop/
- Format your namenode before the first time you set up your cluster. If you format a running Hadoop filesystem, you will lose all the data stored in HDFS.
hadoop namenode -format
- For master node, start everything.
/usr/local/hadoop/sbin/ --config /usr/local/hadoop/etc/hadoop --script hdfs start namenode
/usr/local/hadoop/sbin/ --config /usr/local/hadoop/etc/hadoop/ start resourcemanager
/usr/local/hadoop/sbin/ start proxyserver --config /usr/local/hadoop/etc/hadoop/
/usr/local/hadoop/sbin/ start historyserver --config /usr/local/hadoop/etc/hadoop
- For slave nodes, only start DataNode and NodeManager.
/usr/local/hadoop/sbin/ --config /usr/local/hadoop/etc/hadoop/ start nodemanager
/usr/local/hadoop/sbin/ --config /usr/local/hadoop/etc/hadoop --script hdfs start datanode
- To check your cluster, go to:
- http://master-ip:50070/dfshealth.jsp
- http://master-ip:8088/cluster
- http://master-ip:19888/jobhistory (for Job History Server)
Log files are located under /usr/local/hadoop/logs
You can check the java services running once your cluster is running using jps
- The example below will generate a 10GB set:
Note that the input to teragen is the number of 100 byte rows
cd /usr/local/hadoop/share/hadoop/mapreduce
hadoop jar hadoop-mapreduce-examples-2.6.0.jar teragen 100000000 /terasort/in
- Now, let's do the sort
hadoop jar hadoop-mapreduce-examples-2.6.0.jar terasort /terasort/in /terasort/out
- Validate that everything completed successfully:
hadoop jar hadoop-mapreduce-examples-2.6.0.jar teravalidate /terasort/out /terasort/val
- Clean up, e.g.
hdfs dfs -rmr /terasort/\*