it462 - hadoop lab

IT462 Lab 8: Building Inverted Indexes With MapReduce

Preliminaries:

1) On your Unix drive, in your "IT462" directory, create a folder called "Lab08" (without quotes). All your work for this lab should be inside Lab08 directory.

2) Create an "input" directory inside your Lab08 folder. In Unix, from the command prompt, cd to get to your "input" directory and then copy a file in there:

cp /home/adina/public_html/bible+shakes.nopunc .

That should copy the bible_shakes.nopunc file in your input directory. The file contains the text for the bible and the works of Shakespeare.

Note: This lab is inspired by an assignment given in the Cloud Computing course at UMD.

For this lab, you will create an inverted index. An inverted index is a data structure common to nearly all information retrieval systems, and similar with the index found at the end of most books. Let us consider the following famous lines from Shakespeare's Merchants of Venice, with line numbers included for your convenience:

1: if you prick us do we not bleed

2: if you tickle us do we not laugh

3: if you poison us do we not die and

4: if you wrong us shall we not revenge

An inverted index consists of a list of postings, one for each unique word in the collection. Each posting for a word contains the word itself and a list of (documentID, count) pairs with the documentID and the frequency (count) of the word in that document.

Let's treat each line in the above sample data as if it were a "document", with the documentID being the line number. The complete inverted index would look something like this:

and : 1 : (3, 1)

bleed : 1 : (1, 1)

die : 1 : (3, 1)

do : 3 : (1, 1), (2, 1), (3, 1)

if : 4 : (1, 1), (2, 1), (3, 1), (4, 1)

laugh : 1 : (2, 1)

not : 4 : (1, 1), (2, 1), (3, 1), (4, 1)

poison : 1 : (3, 1)

prick : 1 : (1, 1)

revenge : 1 : (4, 1)

shall : 1 : (4, 1)

tickle : 1 : (2, 1)

us : 4 : (1, 1), (2, 1), (3, 1), (4, 1)

we : 4 : (1, 1), (2, 1), (3, 1), (4, 1)

wrong : 1 : (4, 1)

you : 4 : (1, 1), (2, 1), (3, 1), (4, 1)

As you can see, we have a postings list for each word that appears in the collection. Let us look at the postings list corresponding to the term if in a bit more detail:

if : 4 : (1, 1), (2, 1), (3, 1), (4, 1)

The number directly after the term is its document frequency or df for short. The df specifies the number of documents that contain this term. Since if appears in all four documents, its df is 4. Although the df can be easily reconstructed by counting the number of postings, it is often explicitly stored in the inverted index (this is not required for the lab). The postings list contains a number of postings, each of which is a (docno, tf) tuple. The docno is simply a unique identifier for the document (one through four, in this case). The tf, which stands for term frequency, is the number of times the term appears in the document. The term if appears once in every document. Typically, postings are sorted by ascending docno (as shown above, but again, not required for this lab).

Your Task

Write a MapReduce program that builds an inverted index as described above. You are NOT required to explicitly store the df as above (but see extra credit), only all the individual postings. Note that the description above only specifies the logical structure of the inverted index—you are free in your choice of data structures for the actual implementation (e.g., each posting does not literally need to be a tuple denoted by parentheses).

Use the commands learned in the previous lab to compile and run your program on the Hadoop installation in MI316. Run the inverted indexer on the sample data copied to your input directory, the Bible and the complete works of Shakespeare. As with the above case, treat each line as if it were an individual "document". When you map over a plain text file using TextInputFormat in Hadoop, the key passed to the mapper contains the byte offset of the line from the beginning of the file, while the value contains the text of the line. Use this offset value as the unique docno.

Write the answer to the following questions in yourlastname_Lab08.docx. Note that answering these questions involves processing the output of your MapReduce program. You can use any method to do so (manually look through the output, copy-paste to Excel and process it, write another Java program or Perl script to process it, ...)

Copy-paste the first 10 lines of your inverted index for the provided input file into yourlastname_Lab08.doc file.
Look up the postings corresponding to the term "starcross'd". There should only be one line in the entire collection that contains the term. What is the docno (i.e., byte offset) of that line?
Look up the postings corresponding to the term "red". In how many lines does "red" appear once, twice, three times, etc.?
Do the same for the terms "blue" and "magenta".

Extra credit 1: Write a utility (outside MapReduce) that takes a given docno (i.e., the byte offset) and returns the associated line.

Use your utility to look up the line corresponding to the docno for term "starcross'd" (your answer to question 2 above). What is that line? Write your answer in yourlastname_Lab08.doc file.

Extra credit 2: Modify your MapReduce program to explicitly store the df as in the sample inverted index shown above.

Run your modified MapReduce program and look at the output. Include a print-screen on the output showing postings with the included document frequency into yourlastname_Lab08.doc file.

Practical Tips

In this exercise, you'll have to create and manipulate postings lists, which are complex objects that have their own internal structure. Let's consider a word and its associated postings list as an example:

if : (1, 1), (2, 1), (3, 1), (4, 1)

You have several choices to represent postings lists. First, you can encode them as Java strings wrapped inside Hadoop Text objects. The string format might look something like this:

1,1:2,1:3,1:4,1

This is easy, and you can start with this approach. The downside of this approach is that when manipulating postings to answer some of the questions above, you might have to do a lot of string-based operations (e.g., splits).

The second approach is to write your own custom Writable class(es) to store the postings.

If you decide to adopt the second option, this exercise is a good opportunity to learn about different output formats. An OutputFormat (see Hadoop API) describes how output key-value pairs are written to HDFS. By default, Hadoop uses TextOutputFormat, which writes out the key-value pairs in human-readable text format. This is good for you, but can be annoying if you want to further manipulate the output programmatically—since you'll have the read in the text file and parse the key-value pairs back into Java objects (even if you have your own custom Writables). As an alternative, you might want to consider SequenceFileOutputFormat. You can specify that format with a method in JobConf:

conf.setOutputFormat(SequenceFileOutputFormat.class);

If you do this, the output of your MapReduce job will be stored in one or more SequenceFiles. The advantage of SequenceFiles is that they store key-value pairs in a machine-readable format, i.e., as serialized byte representations of the Java objects (not human readable, but can be programmatically manipulated quite easily). The SequenceFile API provides methods for reading key-value pairs—saving you the effort of having to manually parse plain text files. Of course, SequenceFiles aren't very useful if you are using Text objects as output values.

Turn in (Both electronic and paper submissions are required):

Electronic:

Upload the file yourlastname_Lab08.doc with answers to the given questions (including extra credit if done) to Blackboard.
Upload your MapReduce java file to create the inverted index to Blackboard

Hard-copies:

The completed assignment coversheet. Your comments will help us improve the course.
Hard copy of yourlastname_Lab08.doc
Hard copy of the MapReduce Java code to create and inverted index.