2-node Hadoop Cluster with pc and virtualbox

How to setup Hadoop 2-node Cluster for Linux (Ubuntu 14.04 LTS), with your pc and a virtualbox? Based on Sumit Chawla, I wanted to present my procedure, since it took me some time and it can lead to askubuntu!

// Download an Ubuntu iso (I got the desktop image from here http://releases.ubuntu.com/14.04/)
sudo apt-get install virtualbox 
// http://askubuntu.com/questions/142549/how-to-install-ubuntu-on-virtualbox

// On master + slave-1
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java7-installer
sudo update-java-alternatives -s java-7-oracle
sudo addgroup hadoopgroup
sudo adduser --ingroup hadoopgroup hadoopuser

// on the virtualbox go to (in the menu at the top of the screen)
// Devices -> Network -> Network settings and select for
// Attached to: Bridged Adapter and name: wlan0
// and hit OK.
// Now, we have to find the IP of our pc (the master) and
// of our virtualbox (the slave-1).
// http://stackoverflow.com/a/13322549/2411320 gives:
ifconfig | grep -Eo 'inet (addr:)?([0-9]*\.){3}[0-9]*' | grep -Eo '([0-9]*\.){3}[0-9]*' | grep -v '127.0.0.1'
// which should be executed on both master + slave-1.
// I got 192.168.1.2 for master and 192.168.1.10 for slave-1.
// On master + slave-1, do
sudo pico /etc/hosts
// and insert
// 192.168.1.2     master
// 192.168.1.10    slave-1

// On master + slave-1
sudo apt-get install openssh-server
sudo ufw allow 22
// src: http://askubuntu.com/a/51926/412960

// On master
su - hadoopuser
ssh-keygen -t rsa -P ""
cat /home/hadoopuser/.ssh/id_rsa.pub >> /home/hadoopuser/.ssh/authorized_keys
chmod 600 /home/hadoopuser/.ssh/authorized_keys
ssh-copy-id -i ~/.ssh/id_rsa.pub slave-1
// you should not be asked for a password!!!
ssh slave-1

// On master + slave-1
cd /home/hadoopuser
sudo chown -R hadoopuser /home/hadoopuser/
sudo wget http://apache.tsl.gr/hadoop/core/hadoop-2.6.0/hadoop-2.6.0.tar.gz
sudo tar xvf hadoop-2.6.0.tar.gz
sudo mv hadoop-2.6.0 hadoop
export HADOOP_HOME=/home/hadoopuser/hadoop
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
// now go to
sudo pico /home/hadoopuser/hadoop/etc/hadoop/hadoop-env.sh
// and modify JAVA_HOME=/usr/lib/jvm/java-7-oracle

// On master + slave-1, go to /home/hadoopuser/hadoop/etc/hadoop/core-site.xml
// and between the configurations tags, insert:
<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/hadoopuser/tmp</value>
  <description>Temporary Directory.</description>
</property>

<property>
  <name>fs.defaultFS</name>
  <value>hdfs://master:54310</value>
  <description>Use HDFS as file storage engine</description>
</property>


// On master, edit /home/hadoopuser/hadoop/etc/hadoop/mapred-site.xml.template
// between the configurations tags, insert:
<property>
 <name>mapreduce.jobtracker.address</name>
 <value>master:54311</value>
 <description>The host and port that the MapReduce job tracker runs
  at. If “local”, then jobs are run in-process as a single map
  and reduce task.
</description>
</property>
<property>
 <name>mapreduce.framework.name</name>
 <value>yarn</value>
 <description>The framework for running mapreduce jobs</description>
</property>

// On master + slave-1, edit /home/hadoopuser/hadoop/etc/hadoop/hdfs-site.xml
// between the configurations tags, insert:

<property>
 <name>dfs.replication</name>
 <value>2</value>
 <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
 </description>
</property>
<property>
 <name>dfs.namenode.name.dir</name>
 <value>/hadoop-data/hadoopuser/hdfs/namenode</value>
 <description>Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
 </description>
</property>
<property>
 <name>dfs.datanode.data.dir</name>
 <value>/hadoop-data/hadoopuser/hdfs/datanode</value>
 <description>Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored.
 </description>
</property>
<property>
  <name>dfs.permissions</name>
  <value>false</value>
</property>

// On master + slave-1, edit /home/hadoopuser/hadoop/etc/hadoop/yarn-site.xml
// between the configurations tags, insert:
<property>
 <name>yarn.nodemanager.aux-services</name>
 <value>mapreduce_shuffle</value>
</property>
<property>
 <name>yarn.resourcemanager.scheduler.address</name>
 <value>master:8030</value>
</property> 
<property>
 <name>yarn.resourcemanager.address</name>
 <value>master:8032</value>
</property>
<property>
  <name>yarn.resourcemanager.webapp.address</name>
  <value>master:8088</value>
</property>
<property>
  <name>yarn.resourcemanager.resource-tracker.address</name>
  <value>master:8031</value>
</property>
<property>
  <name>yarn.resourcemanager.admin.address</name>
  <value>master:8033</value>
</property>

// On master
// edit /home/hadoopuser/hadoop/etc/hadoop/slaves with (and delete localhost if found there):
master
slave-1

// On master + slave-1
// give a password to root
sudo passwd
// and in /etc/ssh/sshd_config replace
// PermitRootLogin without-password
// with
// PermitRootLogin yes
sudo service ssh restart
// Now you should be able to do 'ssh root@master'
mkdir -p /hadoop-data/hadoopuser/hdfs/namenode
sudo  mkdir -p /hadoop-data/hadoopuser/hdfs/datanode
sudo chown -R hadoopuser:hadoopgroup /hadoop-data
sudo sudo chown -R hadoopuser:hadoopgroup /home/hadoopuser/hadoop/

// On master
cd /home/hadoopuser/hadoop/bin
// I am not sure if the following line is needed!
sudo ./hdfs namenode -format
cd ../sbin
sudo chmod +x start-dfs.sh
sudo ./start-dfs.sh
sudo chmod +x start-yarn.sh
sudo ./start-yarn.sh
// now open in a browswer http://master:8088/cluster/nodes
// and you should see BOTH master + slave-1 listed!


// On master
// go to /home/hadoopuser/
sudo  chown -R hadoopuser:hadoopgroup tmp
// go to /home/hadoopuser/hadoop/
bin/./hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar pi 30 100

// Next time I wanted to run again the example, after powering-off our cluster, I had to:
cd /home/hadoopuser/
sudo  chown -R hadoopuser:hadoopgroup tmp
su - hadoopuser
cd hadoop
bin/./hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar pi 30 100

Now after reading Python and Hadoop, I had to do that to copy the input files to HDFS:

su - hadoopuser
hadoopuser@gsamaras:~$ hadoop/bin/./hadoop fs -mkdir gutenberg/
hadoopuser@gsamaras:~$ hadoop/bin/./hdfs dfs -copyFromLocal /home/gsamaras/Downloads/gutenberg/ gutenberg/

and for running:

// I am not sure why I use -file
hadoop/bin/./hadoop jar /home/hadoopuser/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar -file mapper.py    -mapper mapper.py -file reducer.py   -reducer reducer.py -input gutenberg/gutenberg/* -output gutenberg/gutenberg-output
cd hadoop
bin/./hdfs dfs -cat ../hadoopuser/gutenberg/gutenberg-output/part-00000

// To delete directory 'gutenberg/out'
hadoopuser@gsamaras:~$ hadoop/bin/./hdfs dfs -rm -f -r gutenberg/out


// https://hadoop.apache.org/docs/r1.2.1/streaming.html
// To specify the number of reducers, for example two, use: 
 $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -D mapred.reduce.tasks=2 \
    -input myInputDirs \
    -output myOutputDir \
    -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
    -reducer /bin/wc 

Have questions? Comments? Did you find a bug? Let me know!😀
Page created by G. (George) Samaras (DIT)