IT462 Lab 8: Building Inverted Indexes With MapReduce
Preliminaries:
1) On your Unix
drive, in your "IT462" directory, create a folder called
"Lab08" (without quotes). All your work for this lab should be inside
Lab08 directory.
2) Create an "input" directory
inside your Lab08 folder. In Unix, from the command
prompt, cd to get to your "input" directory
and then copy a file in there:
cp /home/adina/public_html/bible+shakes.nopunc .
That should copy the bible_shakes.nopunc file in your input directory. The file
contains the text for the bible and the works of Shakespeare.
Note: This lab is inspired by an assignment given in the
Cloud Computing course at UMD.
For this lab, you will create an inverted index. An
inverted index is a data structure common to nearly all information retrieval systems, and similar with the index found at the end of most
books. Let us consider the following famous lines from Shakespeare's Merchants
of Venice, with line numbers
included for your convenience:
1:
if you prick us do we not bleed
2:
if you tickle us do we not laugh
3:
if you poison us do we not die and
4:
if you wrong us shall we not revenge
An inverted index consists of a list of postings, one for
each unique word in the collection. Each posting for a word contains the word
itself and a list of (documentID, count) pairs with
the documentID and the frequency (count) of the word
in that document.
Let's treat each line in the above sample data as if it were
a "document", with the documentID being the
line number. The complete inverted index would look something like this:
and : 1 : (3, 1)
bleed : 1 : (1, 1)
die : 1 : (3, 1)
do : 3 : (1, 1), (2, 1), (3, 1)
if : 4 : (1, 1), (2, 1), (3, 1), (4, 1)
laugh : 1 : (2, 1)
not : 4 : (1, 1), (2, 1), (3, 1), (4, 1)
poison :
1 : (3, 1)
prick : 1 : (1, 1)
revenge : 1 : (4,
1)
shall : 1 : (4, 1)
tickle :
1 : (2, 1)
us : 4 : (1, 1), (2, 1), (3, 1), (4, 1)
we : 4 : (1, 1), (2, 1), (3, 1), (4, 1)
wrong : 1 : (4, 1)
you : 4 : (1, 1), (2, 1), (3, 1), (4, 1)
As you can see, we have a postings list for each word that
appears in the collection. Let us look at the postings list corresponding to
the term if in a bit more detail:
if : 4 : (1, 1), (2, 1), (3, 1), (4, 1)
The number directly after the term is its document
frequency or df
for short. The df
specifies the number of documents that contain this term. Since if
appears in all four documents, its df
is 4. Although the df
can be easily reconstructed by counting the number of postings, it is often
explicitly stored in the inverted index (this is not required for the lab). The
postings list contains a number of postings, each of which is a (docno, tf) tuple. The docno is simply
a unique identifier for the document (one through four, in this case). The tf, which stands for term frequency, is the
number of times the term appears in the document. The term if appears
once in every document. Typically, postings are sorted by ascending docno (as shown above, but again, not required for this
lab).
Your Task
Write a MapReduce program that builds
an inverted index as described above. You are NOT required to explicitly store
the df as above (but see extra credit),
only all the individual postings. Note that the description above only
specifies the logical structure of the inverted index—you are free in your
choice of data structures for the actual implementation (e.g., each posting
does not literally need to be a tuple denoted by
parentheses).
Use the commands learned in the previous lab to compile and
run your program on the Hadoop installation in MI316.
Run the inverted indexer on the sample data copied to your input directory, the
Bible and the complete works of Shakespeare. As with the above case, treat each
line as if it were an individual "document". When you map over a
plain text file using TextInputFormat in Hadoop, the key passed to the mapper
contains the byte offset of the line from the beginning of the file, while the
value contains the text of the line. Use this offset value as the unique docno.
Write the
answer to the following questions in yourlastname_Lab08.docx. Note that
answering these questions involves processing the output of your MapReduce program. You can use any method to do so
(manually look through the output, copy-paste to Excel and process it, write
another Java program or Perl script to process it, ...)
Extra
credit 1: Write a utility (outside MapReduce) that takes a given docno
(i.e., the byte offset) and returns the associated line.
Use your
utility to look up the line corresponding to the docno
for term "starcross'd" (your answer to
question 2 above). What is that line? Write your answer in
yourlastname_Lab08.doc file.
Extra credit 2:
Modify your MapReduce program to explicitly store the
df
as in the sample inverted index shown above.
Run your modified MapReduce program
and look at the output. Include a print-screen on the output showing postings
with the included document frequency into yourlastname_Lab08.doc file.
Practical Tips
In this exercise, you'll have to create and manipulate postings
lists, which are complex objects that have their own internal structure. Let's
consider a word and its associated postings list as an example:
if : (1, 1), (2, 1), (3, 1), (4, 1)
You have several choices to represent postings lists. First,
you can encode them as Java strings wrapped inside Hadoop
Text objects. The string format might look something like this:
1,1:2,1:3,1:4,1
This is easy, and you can start with this approach. The
downside of this approach is that when manipulating postings to answer some of
the questions above, you might have to do a lot of string-based operations
(e.g., splits).
The second approach is to write your own custom Writable class(es) to store the postings.
If you decide to adopt the second option, this exercise is a
good opportunity to learn about different output formats. An OutputFormat (see Hadoop API)
describes how output key-value pairs are written to HDFS. By default, Hadoop uses TextOutputFormat,
which writes out the key-value pairs in human-readable text format. This is
good for you, but can be annoying if you want to further manipulate the output
programmatically—since you'll have the read in the text file and parse the
key-value pairs back into Java objects (even if you have your own custom Writables). As an alternative, you might want to consider SequenceFileOutputFormat. You can specify that format with
a method in JobConf:
conf.setOutputFormat(SequenceFileOutputFormat.class);
If you do this, the output of your MapReduce
job will be stored in one or more SequenceFiles. The
advantage of SequenceFiles is that they store
key-value pairs in a machine-readable format, i.e., as serialized byte
representations of the Java objects (not human readable, but can be
programmatically manipulated quite easily). The SequenceFile
API provides methods for reading key-value pairs—saving you the effort of
having to manually parse plain text files. Of course, SequenceFiles
aren't very useful if you are using Text objects as output values.
Turn in (Both electronic and paper submissions are
required):
Electronic:
Hard-copies: