SI425, Fall 2017

Lab 1: Twitter Hashtag Segmentation

Due date: the start of lab time, Aug 31

Motivation

This lab will take strings of continuous characters and segment them into words. Segmenting text is an important task for many foreign languages, such as Chinese and Arabic. Even in English where whitespace typically separates its words, genres like social media often mash words together into incomprehensible (to computers, at least) strings. Hashtags are one particular example: #segmentingisfun.

Objective

You are given a set of hashtags. Your goal is to segment them into tokens using an English dictionary and the MaxMatch algorithm in your textbook.

#bigbangtheory -> big bang theory
#chickensoup -> chicken soup
#running -> running
#30times -> 30 times
#neverstop -> never stop

I am providing you two incomplete source files: Segment.java and Evaluate.java. Your task is to fill in the missing code.

Code Setup and Data

We will use the Eclipse IDE for most labs. This course will create larger projects with multiple Java files, and an IDE greatly assists you. Please follow the Eclipse setup and installation instructions. You won't have to do this again for future labs.

Starter Java code is provided, as well as hashtag data. Make sure you can access the following directories:
/courses/nchamber/nlp/lab1/java/ : the Java code provided for this lab
/courses/nchamber/nlp/lab1/data/ : the data sets used in this lab

Create a lab1 directory in your local space, and copy /courses/nchamber/nlp/lab1/ to it (cp -R /courses/nchamber/nlp/lab1 lab1). We use ant in this class to compile our java code. Ant requires a build script, build.xml, which you can already find in the java directory. Just type ant from within the java/ directory, and it will compile. Make sure it compiles without error. The java/src/usna/segment/ directory contains the main files. You will edit Segment.java and Evaluate.java for this lab. You will use Levenshtein.java as helper code.

Setting up Eclipse: I suggest using Eclipse as your IDE. You may use whatever you want to write your Java code. Open Eclipse. Click File->New->Project->"Java Project from Existing Ant Buildfile". Browse to find the build.xml file in your new lab1 directory. Click Finish. You are ready to go! Open up Segment.java to see where you will place your code.

Run the code: I've provided two run scripts, run-segment and run-evaluate for your convenience. These are just bash scripts (take a peek in them to see that they just make basic java calls). From your local directory lab1/java/ enter:

run-segment max ../data/bigwordlist.txt ../data/hashtags-train.txt

You're just calling the Segment.java main function and giving it two paths to text files. If everything is working, there will be no errors, and no output...because you haven't filled in the functions yet!

Data files: finally, look in lab1/data and you will see a dictionary file, bigwordlist.txt. You will also see two dataset files hashtags-train.txt and hashtags-dev.txt. These are discussed below.

Part One: MaxMatch

Implement the MaxMatch algorithm as described in your textbook, Jurafsky/Martin. Briefly, the algorithm starts from the first character and finds the longest matching known English word. It then continues from the end of that word, repeating the lookup. You obviously will need an English dictionary. Use the one on the shared drive at /courses/nchamber/nlp/lab1/bigwordlist.txt. It contains a list of words and how often they occurred in a large corpus. Note the list is very noisy (as much of NLP is).

Implement MaxMatch in Segment.java. You should implement it inside the maxMatch(String filename) method. This method should simply output your segmentation guesses, one per line, and nothing else. Do not output the hashtag character, but just your segmented words (as in the Objective section example above). Run it on the provided list of hash tags at /courses/nchamber/nlp/lab1/hashtags-train.txt. The code is easy to run: java Segment max <dictionary-path> <hashtags-path>. I also provided a script run-segment.

You then need to subjectively study your output. Answer the following questions. You'll want to scan through your output and see what kinds of errors have appeared:

  1. When MaxMatch is incorrect, what types of failures do you see and what causes each? List at least 3 different types of errors that you observe with an example for each.

Part Two: Empirical Evaluation

We now want to objectively evaluate our performance. Part Two has you compute a score for your system. Note that the lab directory also has the files hashtags-dev.txt and hashtags-dev-gold.txt. These contain more tags, but also the correct answers (gold tags). You will now compare your system's output to gold answers.

The textbook describes the minimum edit distance score to compare how different two strings are (by counting the number of changes to transform one to the other). Lucky for you, I'm providing code that computes this for you free of charge. Your task is to use this distance metric to compute a Word Error Rate (WER) between your guesses and the gold answers. WER is just the length normalized minimum edit distance (i.e, minimum edit distance divided by the length of the correct segmentation string). Use my provided code Levenshtein.java to compute the edit distance score (create a Levenshtein object first, then call score):

Levenshtein.score(String a, String b);

Your task is to take my completed Levenshtein scorer and create a working WER that takes your output with gold answers and returns the average WER across that test set. Do this in Evaluate.java in the avgWER(guessPath, goldPath) and WER(guess, gold) methods. Note that avgWER() needs to read two files: (1) your output from Part One, and (2) my provided gold answers. So you'll just want to pipe Part One's output to a file. Then run the code: java Evaluate <guesses-file> <gold-file>

  1. What is the average WER of MaxMatch on your test data? (this is one number: the average WER across all test hashtags) Your number will be small, as in, far less than 1.0. An average error of 1.0 would mean you made one mistake on every character you print. For a correct MaxMatch implementation, look for something around 0.07.

Part Three: Improve MaxMatch

Using your analysis from Parts 1 and 2, you will now improve the algorithm. Your goal is to improve the WER score, and the sky's the limit on what you should try. Points will be given to the number of different improvements, creativity, and overall WER improvement. Do not change your maxMatch() method. You should copy that code into improvedMaxMatch() and then make your improvements there. You can run the system now: java better <hashtags-path>

  1. List the improvements you made.
  2. What is the final WER of your improved algorithm?

What to turn in

  1. A single text file called readme.txt with answers to the four questions in Parts 1,2,3. Save it in your lab1 directory.
  2. Completed Segment.java and Evaluate.java files. You must complete the WER, avgWER, maxMatch, improvedMaxMatch functions. Do not change the behavior of main(). Segment.java takes two command-line arguments: (1) a token ("max" or "better") that causes maxMatch or improvedMaxMatch to be called, and (2) a text file with one hashtag per line. Your code should open and read that file, then output your segmentations, one per line. Do not print anything except the segmentations. The number of lines output must equal the number of lines input.

How to turn in

Use the submit script: /courses/nchamber/submit

Create a directory for your lab called lab1 (all lowercase). Thee should be just three things in this directory: readme.txt, java/, data/ . When you are ready to turn in your code, execute the following command from the directory one level above lab1:
    /courses/nchamber/submit  lab1

Double-check the output from this script to make sure your lab was submitted. It is your responsibility if the script fails and you did not notice. The script will print out "Submission successful." at the very end of its output. If you don't see this, your lab was not submitted.

This works in the lab machines. It has not been tested on your VM mounts. Please email me (Chambers) if you have problems.

Competition!

Your submissions will be automatically tested. It is thus very important that you follow the above output instructions correctly, or your final WER score will be incorrect. I will post the final results on our page here. The top two performing submissions will receive extra credit points.

Grading

Part One: 20 pts

  1. MaxMatch correctly implemented: 14 pts
  2. Failure analysis: 6 pts
Part Two: 10 pts
Part Three: 10 pts
  1. Multiple improvements made: 6 pts
  2. Final WER reported and performance improved: 4 pts (0 pts if WER does not significantly improve)
Compiling Penalty: your code does not compile: -10 pts
Output Penalty: your output does not conform to the requirements: -5 pts
Extra Credit: best class performance, or creative/lots of MaxMatch improvements

Total: 40 pts