SI485i, Fall 2012
Due date: the start of class, Nov 1
Syntactic parsing is used in most advanced NLP applications today. From machine translation, to information extraction and document classication, having the syntactic structure of a sentence gives your learning algorithms much more useful information than bare words alone. You will write the standard CKY algorithm to parse sentences.
You will write the CKY algorithm. Most of the code to create the PCFG itself is already written for you. You just need to read in a corpus of sentences with their parse trees, binarize the trees (I even give you binarizing code), and the PCFG is created for you. Given this PCFG, write the CKY algorithm.
The input to your program is a list of training trees, and a list of test trees. Your algorithm will be evaluated based on how similar its output (your guessed trees) is to the gold answer (hand-created trees). The CKYParser.java class is where your code will go. There are two functions: train(List
Baseline Parser. There is a class called BaselineParser that creates naive trees for each sentence. This is meant to help you understand how to use the code infrastructure. This basline takes a sentence, tags each word with its most likely tag (i.e., a unigram tagger), then looks for occurrences of the tag sequence in the training set. If it finds an exact match, it answers with the training parse of the matching training sentence. If no match is found, it constructs a right-branching tree with node labels chosen independently, conditioned only on the length of the span of a node. If this sounds like a strange (and terrible) way to parse, it should.
PCFG. Look at Grammar.java. It is your friend. This computes the rule probabilities in your PCFG. It simply sets each probability as follows:
P(N -> XX) = C(N -> XX) / Sum_YY(C(N->YY))
Your train() function will simply create a Grammar instance and initialize that object with the training trees. It will compute the above probabilities for you.
CKY. You must implement the CKY algorithm correctly, but you must also implement it smartly. Your parser should not take minutes to run on each sentence. Think carefully about what data structures you will use before starting. Some hints:
miniTest: Get your parser working on the miniTest dataset before you attempt the treebank datasets. The miniTest data set consists of 3 training sentences and 1 test sentence from a toy grammar. The training set contains just enough examples to produce a PP-attachment ambiguity in the test sentence.
Binarize rules: As we discussed in class, most parsers require grammars to have at most binary branching rules. You can binarize and unbinarize trees using the TreeAnnotations class. Call TreeAnnotations.annotateTree(tree) to binarize a tree, and TreeAnnotations.unannotateTree(tree) to convert from binary back to normal. Call that function and output the before/after trees to see what it does. After binarizing your training trees, you can use Grammar.java to build the PCFG.
Training Data: Look in some of the text files in data/genia and data/bioie to get an idea of the complexity of these sentences and their parse trees. Something you’ll notice is that the grammar has relatively few non-terminal symbols (27 plus part-of-speech tags) but thousands of rules, many ternary-branching or longer.
CKYParser.java: put your code here.
BaselineParser.java: a dumb parser, but whose code may help you.
Lexicon.java: call getAllTags() to retrieve all POS tags seen in training. Call scoreTagging(word, POStag) to get the P(word | tag).
Grammar.java: Create a new grammar, fully trained with probabilities, by simply saying new Grammar(binaryRules). Then call getBinaryRulesByLeftChild("DT") to retrieve a List of BinaryRule objects where the left child of their right-hand-side was a "DT".
TreeAnnotations.java: Call TreeAnnotations.annotateTree(tree) to binarize a tree, and TreeAnnotations.unannotateTree(tree) to convert from binary back to normal.
Tree.java: Look in this file for its useful functions. Note that you can create your own tree:
new Tree
UnaryRule: has a parent ("NP") and a single child ("NN") and a probability score.
BinaryRule: has a parent ("NP"), a left child ("DT"), a right child ("NN"), and a probability score.
Starter Java code is provided, as well as training and test data. Make sure you can access the following directories:
/courses/nchamber/nlp/lab6/java/ : the Java code provided for this course
/courses/nchamber/nlp/lab6/data/ : the data sets used in this assignment
Create a lab6 directory in your local space, and copy lab6/java/ to it (cp -R /courses/nchamber/nlp/lab6/java lab6/). There is a build.xml file included, so just type ant in the java/ directory. Make sure it compiles without error. Ant compiles all .java files in this directory structure, so you shouldn't have to change build.xml otherwise. Make sure you can run the code. There is a run script which does this for you!
Eclipse setup: Click New->Project->"Java Project from Existing Ant Buildfile". Browse to find the build.xml file in your new lab6 directory. You are ready to go! Open up CKYParser.java to see where you will place your code.
Use the run script and it is very easy. There are three modes:
run -data miniTest Runs your code on the single test sentence. Also takes bioie,genia,combo
run -parser usna.parser.CKYParser Runs your parser, not the baseline.
Use the submit script: /courses/nchamber/submit
Create a directory for your lab called lab6 (all lowercase). When you are ready to turn in your code, execute the following command from the directory one level above lab6:
/courses/nchamber/submit lab6
Double-check the output from this script to make sure your lab was submitted. It is your responsibility if the script fails and you did not notice. The script will print out "Submission successful." at the very end of its output. If you don't see this, your lab was not submitted.
CKY Algorithm: 40 pts
Code is commented: 2 pts
Compiling Penalty: your code does not compile: -10 pts
Extra Credit: Implement vertical markovization: 8 pts!
Total: 42 pts