IT462 lab - Hadoop on cluster

IT462 Lab 9: Hadoop on the USNA cluster

The primary purpose of this assignment is to familiarize yourself with using Hadoop on the USNA cluster. While each machine in MI316 has a stand-alone Hadoop installation, the USNA cluster has about 40 machines as part of a distributed Hadoop installation.

For technical reasons, we will develop the programs on the stand alone Hadoop installation in MI316, and then copy the jar file and run the program on the cluster (we will not compile on the cluster).

For this assignment, we will use again the WordCount example.

1. On your Unix drive, in your IT462 folder, create a folder called "Lab09" (mkdir Lab09). All your work for this lab should be inside Lab09 directory.

2. Create an "input1" directory inside your Lab09 folder (mkdir input1). In Unix, from the command prompt, cd to get to your "input1" directory and then copy 2 files in there:

cp /home/adina/public_html/it462/file* .

Note the dot at the end, denoting the current directory. That should copy the files file01.txt and file02.txt in your "input1" directory. The files are the same with the ones used in Part1 of Lab07. View their contents to refresh your memory.

3. Create an "input2" directory inside your Lab09 folder. In Unix, from the command prompt, cd to get to your "input2" directory and then copy 1 file in there:

cp /home/adina/public_html/it462/suggestedTopics.txt .

That should copy the suggestedTopics.txt file in your "input2" directory. The file contains the topics I suggested for your presentations.

Create an "input3" directory inside your Lab09 folder. In Unix, from the command prompt, cd to get to your "input3" directory and then copy 1 file in there:

cp /home/adina/public_html/it462/bible+shakes.nopunc .

That should copy the bible_shakes.nopunc file in your "input3" directory. The file contains the text for the bible and the works of Shakespeare. This is the same file you used for Lab08

4. Now open a terminal window and ssh to connect to a machine on the USNA cluster. We will use thresher.

ssh thresher.ewlab.usna.edu

Your username is mXXXXXX where XXXXXX is your alpha and your password will be provided by the instructor.

5. Copy all your current content from Lab09 on your CS UNIX account to thresher:

We will copy things a lot during this lab. The command used to copy files between a source and a destination is scp:

In general, to copy files to a "remote" host (while logged into a "local" host):

scp SourceFile user@host:directory/TargetFile

Copying file from a remote host:

scp user@host:directory/SourceFile TargetFile

scp -r user@host:directory/SourceFile TargetFolder

In particular, to copy your entire Lab09 directory from your CS Unix account to thresher, you can execute the following command:

While ssh-ed to thresher, execute the following command:

scp -r mXXXXXX@mara.cs.usna.edu:/home/mXXXXXX/IT462/Lab09 .

(note the last . on the previous line, to denote the current directory)

6. Back to your CS machine, copy my version of WordCount.java into your account

From the terminal, cd to Lab09 directory and execute the following command

cp /home/adina/public_html/it462/WordCount.java .

Now view the source for WordCount.java which counts the number of occurrences of each word in a given input set. It is almost the same as the first example in Lab07, but it has one addition in the main: the program now accepts 4 input parameters, the last 2 optional: the HDFS input directory for the data, the output directory, the number of mappers, and the number of reducers to be used for the job.

Part 1: WordCount on small example

To compile the code and generate the jar file, use the following commands (from the command prompt - while logged into one of the Unix machines in MI316, in your Lab09 directory):

A: To compile (on your CS MI316 machine):

javac -classpath /usr/lib/hadoop-0.20/hadoop-core.jar WordCount.java

If you now execute ls from the command prompt, you should see the following files in your Lab09 folder: WordCount.class WordCount.java* WordCount$Map.class WordCount$Reduce.class

B: To create the jar file (on your CS MI316 machine):

jar -cvf count.jar *.class

C: Copy the jar file to USNA cluster (from your CS MI316 machine)

scp count.jar mXXXXXX@thresher.ewlab.usna.edu:/home2/mXXXXXX/Lab09/count.jar

D: Copy files to HDFS on the USNA cluster

Hadoop expects that the inputs are stored in the Hadoop Distributed File System (HDFS).

While ssh-ed on thresher.ewlab.usna.edu, cd to Lab09

-create an HDFS input directory called "hdfsinput1" inside Lab09, as such:

hadoop dfs -mkdir hdfsinput1

Note that you are using the HDFS (Hadoop Distributed File System) command to create a directory, not the regular Unix command (mkdir hdfsinput1).

To check that the directory was created, use

hadoop dfs -ls

-Copy the files from "input1" directory into HDFS:

hadoop dfs -copyFromLocal input1/* hdfsinput1

E: Execute WordCount on USNA cluster

The following command will count the words in all files in the hdfsinput1 and store the results in a newly created "output1" directory in HDFS

hadoop jar count.jar WordCount hdfsinput1 output1

F: Check output in HDFS:

Check the content of the output1/part-00000 file using:

hadoop dfs -cat output1/part-00000 | head

Take a print screen of the command window showing the content of output1/part-00000 file and paste it into yourlastname_Lab09.doc

Part 2: Access the JobTracker's web interface

You can access some of the information generated by the JobTracker, responsible for all Hadoop jobs, by using the web interface available at http://thresher.ewlab.usna.edu:50030/jobtracker.jsp

Use this interface to answer the following questions:

Question 1. What was your job id for one of your runs? If you ran the code more than once, any job id of a successful run will do.

Take a print screen of the browser showing that job id and paste it into yourlastname_Lab09.doc

Question 2. How many map tasks does your job contain? How many reduce tasks does your job contain?

Take a print screen of the browser showing the number of map tasks and reduce tasks for your job and paste it into yourlastname_Lab09.doc

Question 3. How many records (lines) are in the input data? (Look for the "Map input records" counter)

Take a print screen of the browser showing the number of map input records for your job and paste it into yourlastname_Lab09.doc

Part 3 WordCount for bigger files

Repeat Part 1- D, E, and F to execute WordCount for the file in input2 in your Lab09 directory on thresher.ewlab.usna.edu (the suggestedTopics.txt file)

Take a print screen of the command window showing the content of output2/part-00000 file and paste it into yourlastname_Lab09.doc

Now, run the word count demo on this dataset, with a specified number of mappers (15) and reduces (8), as such:

hadoop jar count.jar WordCount hdfsinput2 output2_15_8 15 8

Question 4. What is the 6th word in output2_15_8/part-00005 and how many times does it appear? (use the same command as in Part1-F to see the content of the file)

Take a print screen of the command window showing the content of output2_15_8/part-00005 file and paste it into yourlastname_Lab09.doc

Question 5. How many map tasks does your last job contain? How many reduce tasks does your job contain?

Take a print screen of the browser showing the number of map tasks and reduce tasks for your job and paste it into yourlastname_Lab09.doc

Part 4 (Extra credit)

Repeat Part 1- D, E, and F to execute WordCount for the file in input3 in your Lab09 directory on thresher.ewlab.usna.edu (the bible and Shakespeare file)

Take a print screen of the command window showing the content of output3/part-00000 file and paste it into yourlastname_Lab09.doc

You'll notice that there is a lot of "junk" in the output. Let's try to clean this up by throwing away terms that don't appear often. Modify the word count demo so to retain only words that occur more than 10 times (i.e., count > 10).

Once you've modified the program on your MI316 machine, compile it and create the jar file, then copy the jar file to thresher.ewlab.usna.edu and run it again:

hadoop jar count.jar WordCount hdfsinput3 output3_20_5 20 5

This time, use only 5 reducers. Note the slightly different path in which to put your results.

Question 6. How many terms appear more than 10 times in the collection?

Write your answer to yourlastname_Lab09.doc

Turn in (Both electronic and paper submissions are required):

Electronic:

Upload the file yourlastname_Lab09.doc with answers to Part 1, Part 2, Part3 (and Part 4 for extra credit ) to Blackboard.

Hard-copies:

The completed assignment coversheet. Your comments will help us improve the course.
Hard copy of yourlastname_Lab09.doc