SI425, Fall 2017

# Lab 6: CKY Parser

Due date: the start of class, Nov 9
Milestone: Oct 26, Diagonal w/ unary rules
Milestone: Nov 2, Steps 1-4 completed ("Steps for the Lost" below)

CKY Algorithm Psuedocode

## Motivation

Syntactic parsing is used in most advanced NLP applications today. From machine translation, to information extraction and document classication, having the syntactic structure of a sentence gives your learning algorithms much more useful information than bare words alone. You will write the standard CKY algorithm to parse sentences.

## Objective

You will write the CKY algorithm. Most of the code to create the PCFG itself is already written for you. You just need to read in a corpus of sentences with their parse trees, binarize the trees (I even give you binarizing code), and the PCFG is created for you. Given this PCFG, write the CKY algorithm.

## CKY Algorithm

The input to your program is a list of training trees, and a list of test trees. Your algorithm will be evaluated based on how similar its output (your guessed trees) is to the gold answer (hand-created trees). The CKYParser.java class is where your code will go. There are two functions: train(List) and getBestParse(List). Your implementation of CKY goes in getBestParse. You build the grammar and its probabilities in train().

Baseline Parser. There is a class called BaselineParser that creates naive trees for each sentence. This is meant to help you understand how to use the code infrastructure. This basline takes a sentence, tags each word with its most likely tag (i.e., a unigram tagger), then looks for occurrences of the tag sequence in the training set. If it finds an exact match, it answers with the training parse of the matching training sentence. If no match is found, it constructs a right-branching tree with node labels chosen independently, conditioned only on the length of the span of a node. If this sounds like a strange (and terrible) way to parse, it should.

Lexicon. Look at Lexicon.java. This stores all Part of Speech tag unary rules with words: NNS->cats. Use it to lookup those rules.

PCFG. Look at Grammar.java. It is your friend. This computes the rule probabilities in your PCFG. It simply sets each probability as follows:
P(N -> XX) = C(N -> XX) / Sum_YY(C(N->YY))
Your train() function will simply create a Grammar instance and initialize that object with the training trees. It will compute the above probabilities for you.

CKY. You must implement the CKY algorithm correctly, but you must also implement it smartly. Your parser should not take minutes to run on each sentence. Think carefully about what data structures you will use before starting. Some hints:

1. I have provided you CKY pseudocode. That link is your friend. Do not loop over all non-terminals like this pseudocode. It's pseudocode for a reason...don't just follow it blindly. The logic is correct, so follow that, but be smart about data structures and method calls.
2. The CKY table is an n x n matrix. Think about using a double array. A double array of what? Create a java class to represent a single cell in the table. Each cell in the array needs to store non-terminals and their probabilities...as well as back pointers to the cells that the right-hand sides of the rules matched. Have your custom java class do it all.
3. The util/ directory contains some useful classes, such as Pair.java and Triplet.java. If you need to store two things together (rule name and its probability?), these might work for you. You don't have to use them, though.
4. Unary rules. Don't forget that after you've filled a table cell in the CKY algorithm, you still need to look for unary rules that might apply to what you have in there. Write a function called handleUnaries(cell) to do this for you. Again, see the pseudocode above.

miniTest: Get your parser working on the miniTest dataset before you attempt the treebank datasets. The miniTest data set consists of 3 training sentences and 1 test sentence from a toy grammar. The training set contains just enough examples to produce a PP-attachment ambiguity in the test sentence. You should score 100% on this single test tree.

Binarize rules: As we discussed in class, most parsers require grammars to have at most binary branching rules. You can binarize and unbinarize trees using the TreeAnnotations class. Call TreeAnnotations.binarizeTree(tree) to binarize a tree, and TreeAnnotations.unBinarizeTree(tree) to convert from binary back to normal. Call that function and output the before/after trees to see what it does. After binarizing your training trees, you can use Grammar.java to build the PCFG.

Training Data: Look in some of the text files in data/genia and data/bioie to get an idea of the complexity of these sentences and their parse trees. Something you'll notice is that the grammar has relatively few non-terminal symbols (27 plus part-of-speech tags) but thousands of rules, many ternary-branching or longer.

## Steps for the Lost

1. Write train()
2. Write the first part of getBestParse() to only fill the diagonal of the CKY table. Take each word, find its POS tag rules in the lexicon (using Lexicon.java), and fill in the diagonal. Write print statements to verify it is correct.
3. Write handleUnaries() from the pseudocode. Use Grammar.java Add it to the diagonal code you just finished above. Print statements. Is it correct?
4. Write the rest of the pseudocode, filling in the entire table.
5. Write buildTree() to create the Tree structure from the filled in table.

## Milestone 1 Goal: miniTest diagonal

For the milestone, this is what the diagonal of your table should look like for "cats scratch walls with claws". Each cell has 6 rules with correct probabilities.

 @PP->_P=0.311, V=0.064, P=0.048, N=0.350, NP=0.311, @VP->_V=0.207 @PP->_P=0.057, V=0.619, P=0.051, N=0.064, NP=0.057, @VP->_V=0.038 @PP->_P=0.167, V=0.077, P=0.058, N=0.188, NP=0.167, @VP->_V=0.111 @PP->_P=0.057, V=0.068, P=0.876, N=0.064, NP=0.057, @VP->_V=0.038 @PP->_P=0.240, V=0.068, P=0.051, N=0.270, NP=0.240, @VP->_V=0.160

## Code Details

Lexicon.java: call getAllTags() to retrieve all POS tags seen in training. Call getRuleProbability(POStag, word) to get the P(word | tag).

Grammar.java: Create a new grammar, fully trained with probabilities, by simply saying new Grammar(binaryRules). Call getBinaryRulesByLeftChild("DT") to retrieve a List of BinaryRule objects where the left child of their right-hand-side was a "DT". Call getUnaryRulesByOnlyChild("NN") to get all unary rules that generate the tag "NN".

TreeAnnotations.java: Call TreeAnnotations.binarizeTree(tree) to binarize a tree, and TreeAnnotations.unBinarizeTree(tree) to convert from binary back to normal.

Tree.java: Look in this file for its useful functions. Note that you can create your own tree:
new Tree(nonTerminal, children), where nonTerminal is a rule name like "NP" and children is a List of other trees, List<Tree<String>>. You will need to do this when you rebuild your tree after filling in the CKY table.

UnaryRule: has a parent ("NP") and a single child ("NN") and a probability score.

BinaryRule: has a parent ("NP"), a left child ("DT"), a right child ("NN"), and a probability score.

## Code Setup

Starter Java code is provided, as well as training and test data. Make sure you can access the following directories:
/courses/nchamber/nlp/lab6/java/ : the Java code provided for this course
/courses/nchamber/nlp/lab6/data/ : the data sets used in this assignment

Create a lab6 directory in your local space, and copy lab6/java/ to it (cp -R /courses/nchamber/nlp/lab6/java lab6/). There is a build.xml file included, so just type ant in the java/ directory. Make sure it compiles without error. Ant compiles all .java files in this directory structure, so you shouldn't have to change build.xml otherwise. Make sure you can run the code. There is a run script which does this for you!

Eclipse setup: Click New->Project->"Java Project from Existing Ant Buildfile". Browse to find the build.xml file in your new lab6 directory. You are ready to go! Open up CKYParser.java to see where you will place your code.

## How to Run the Code

Use the run script and it is very easy. There are three modes:
run -data miniTest Runs your code on the single test sentence. Also takes bioie,genia,combo
run -parser usna.parser.CKYParser Runs your parser, not the baseline.

## What to turn in

1. Two files out-genia.txt and out-bioie.txt from running your parser twice and saving the output:
run -data genia -parser usna.parser.CKYParser |& tee out-genia.txt
run -data bioie -parser usna.parser.CKYParser |& tee out-bioie.txt
2. Your lab6/ directory containing just your java/ subdirectory. The code should of course compile and run with the run script provided to you.

## How to turn in

Use the submit script: /courses/nchamber/submit

Create a directory for your lab called lab6 (all lowercase). When you are ready to turn in your code, execute the following command from the directory one level above lab6:
/courses/nchamber/submit  lab6

Double-check the output from this script to make sure your lab was submitted. It is your responsibility if the script fails and you did not notice. The script will print out "Submission successful." at the very end of its output. If you don't see this, your lab was not submitted.