Tuesday, March 13, 2012

Cloudera's Hadoop Demo VM with VirtualBox: Running WordCount.java on the VM, a Step-by-Step Tutorial for Beginners

___ABOUT_THIS_TUTORIAL___

NOTE: Work in progress, please report any errors while following this tutorial in the comments! (v2.0)

I found setting up and running Hadoop on the free Cloudera VM very frustrating. For a while, I was stuck with a problem where map and reduce were both stuck at 0%, and the VM would eventually crash. I found out that it was due to a heap space error, but that didn't help as I was running mapreduce on two text files, each consisting of one line.

After a lot of trial and error, and reading of misleading, meandering, and incomplete "tutorials" online, I've written my own which I hope will help anyone trying to run Hadoop using the Cloudera VM without any problems. I did learn a lot and get familiar with the terminal again, so it wasn't a waste of time...but I spent an embarrassing amount of time searching for the cause of my failure to run WordCount - when it was all due to incorrect setup.

All commands are to be typed AS IS, unless there is text in this format: .

I guess it's still a work in progress in terms of clarity and beginner-friendliness, and feedback is welcomed.

__________________________



Versioning:
Tutorial 1.0
VirtualBox 4.1.8
Cloudera VM image download date: 17/02/2012, size: 3.64GB

Prerequisite knowledge
Background knowledge of MapReduce and Hadoop is needed. Basic knowledge of Java is also needed to understand the WordCount example.

It is assumed that you have basic familiarity with using Linux OS and the terminal, especially commands such as: cd, ls, man, rm, mv, cp, etc. Googling what you're trying to do or the commands mentioned will generally yield the command you're looking for. Using the Tab key for autocompletion will save a lot of time and effort.

If you get "stuck" inside a command or want to terminate any running command, use Ctrl + C.

DOWNLOADS
Install either VirtualBox, KVM, or VMware (compatible with Mac), and download the appropriate image from the Cloudera Hadoop Demo VM page.

--

ADVANCED: You may use WinMD5Sum to check the downloaded image with the hashes provided.

--

◇ ◇

STARTING UP VIRTUALBOX, VM SETUP
Start up VirtualBox: create new virtual machine using the New button.
Type the name for the VM, and choose Linux and Ubuntu.
Set the amount of memory dedicated to the VM. Do NOT drag the slider below 512MB. Choose Use existing hard disk, select the image on your local disk, finish.
DO NOT START IT YET. Go to Settings > System, tick Enable IO APIC.

ALTERNATIVE: STARTING UP VMware, VM SETUP
VMware does not appear to need any additional setup; just run the image using VMware Player.


◇ ◇

RUNNING THE CLOUDERA IMAGE

Start the VM.
Once the VM has booted up, open up the terminal. You are using the CentOS distro (distribution) of Linux, Xfce desktop environment.

--

ADVANCED:

$ sudo -s


This gives permanent root privileges. If you do not do this, you must prepend most commands with sudo.

--

Input the following commands (yum is like apt-get). Prepend commands with sudo if they do not seem to work, or require permissions (unless you've entered sudo -s
previously).

$ yum update
$ yum install gcc
$ yum install kernel-devel

◇ ◇

SETTING UP HADOOP

Open a web browser and download the latest stable release of Hadoop here. For this tutorial, I downloaded hadoop-1.0.0/ (15-Dec-2011 16:51).

Save the tar.gz archive to your Desktop.

Move it to /usr/local and untar it using the commands below:


$ mv
hadoop-1.0.0.tar.gz /usr/local
$ cd /usr/local
$ tar xzf hadoop-1.0.0.tar.gz

This command should now show the Hadoop commands:

$ hadoop

Check where Java is installed, and which version, with:
$ which java
$ java -version

--

VI CRASH COURSE:

vi is a text editor within the terminal. You can use emacs, vim, etc., but vi is used throughout this tutorial.

After the command

$ vi

is entered, you are in the vi editor.

Hit the I key to start typing (INSERT mode).
Hit Esc to get out of any mode (e.g. INSERT mode).
When not in any mode, type :wq to write (save) and quit. In case of conflict, type :q! to force quit (changes may not be saved).

See this page for a more comprehensive guide to vi commands.

--

Go to /usr/local/hadoop-1.0.0/conf (note: it is NOT the directory /bin/hadoop-1.0.0) to change the JAVA_HOME variable in hadoop-env.sh to the information you just displayed (remember to uncomment it by removing the "#"), for example, at the time of writing:

$ cd /usr/local/hadoop-1.0.0/conf

Make sure to use sudo, since the file is set to readonly.

$ sudo vi hadoop-env.sh

export JAVA_HOME=/usr/java/jdk1.6.0_21

RELAUNCH THE TERMINAL.

Try this command, it should display information about hadoop:

$ hadoop

--

Troubleshooting note: If there is ANY problem with the bin/hadoop command, i.e. it returns "No such directory", then check the path of the java jdk, check that you've uncommented the line in hadoop-env.sh. It MUST be /usr/local/hadoop-1.0.0, because the directory /bin/hadoop-1.0.0 is NOT checked.

--

Open bashrc:

$ vi ~/.bashrc

Paste these lines into bashrc:

export HADOOP_HOME=/usr/lib/hadoop-0.20/
export HADOOP_VERSION=0.20.2-cdh3u3

RELAUNCH THE TERMINAL.

◇ ◇

SETTING UP THE CANONICAL MAPREDUCE EXAMPLE, WORDCOUNT
Make these directories on the Desktop:

/Desktop/wordcount/wordcount_classes/org/myorg

Inside /myorg, make WordCount.java, pasted from this page.

Compile WordCount.java:
$ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar WordCount.java

ssh into localhost:
$ ssh localhost

cd to /usr/local/hadoop-1.0.0.
$ cd /usr/local/hadoop-1.0.0

Make the HDFS directory used in the tutorial:
$ bin/hadoop dfs -mkdir /usr/joe/wordcount/input/

Use this to check that the input folder is there:

$ bin/hadoop dfs -ls /usr/joe/wordcount/

Make the two input files used in the tutorial locally somewhere (e.g. in the LOCAL wordcount folder:

$ vi file01

Containing one line: "Hello World Bye World"

$ vi file02

Containing one line: "Hello Hadoop Goodbye Hadoop"

cd back to /usr/local/hadoop-1.0.0.

Put them on the HDFS:

$ bin/hadoop dfs -put /home/cloudera/Desktop/file01 /usr/joe/wordcount/input
$ bin/hadoop dfs -put
/home/cloudera/Desktop/file02 /usr/joe/wordcount/input

◇ ◇

RUNNING WORDCOUNT

Go to /Desktop/wordcount/ and make a jar (should be local) from the compiled WordCount.java:
$ jar -cvf wordcount.jar -C wordcount_classes/ .

View contents of jar:
$ jar tf wordcount.jar

Run the jar:
$ sudo bin/hadoop jar /home/cloudera/Desktop/wordcount/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output

--

Note: If you get a FileAlreadyExistsException relating to /usr/joe/wordcount/output, it means that the HDFS output directory /usr/joe/wordcount/output must be deleted with the following command before running the program again:

$ sudo bin/hadoop dfs -rmr /usr/joe/wordcount/output

If any other error comes up while using bin/hadoop, try prepending sudo to the command, or give permanent (session) root privileges.

Also, check paths with:
$ pwd

as you may have chosen different locations for files; do not blindly follow this tutorial! Keep in mind that versions may also change with updates and new releases, e.g. Java, Hadoop, etc.

--

Check the logs using the web interface:
http://localhost:50030/logs/

--

Tutorials followed:

http://hadoop.apache.org/common/docs/current/single_node_setup.html#Prepare+to+Start+the+Hadoop+Cluster
http://icrushservers.blogspot.com/2011/12/running-first-hadoop-job-with-clouderas.html
http://www.cloudera.com/blog/2009/07/cloudera-training-vm-virtualbox/
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
https://ccp.cloudera.com/display/SUPPORT/Hadoop+Tutorial

--

Other commands that may be useful:

$ bin/hadoop job -list-active-trackers

Scripts to restart everything:
$ /usr/lib/hadoop-0.20/bin/stop-all.sh
$ /usr/lib/hadoop-0.20/bin/start-all.sh