Lab 7: CKY Parsing

Milestone Due: next week Nov 3
Lab Due: next week Nov 13

Motivation

We have studied syntax and parse trees for a bit, as well as the complexities behind choosing the correct human interpretation. Today you will implement the CKY algorithm to instead have your computer automatically parse sentences for you. You'll see they are much quicker and often better than a linguistic novice like yourself.

Starter Code

Download the starter code and data and extract it.

data/: lots of parse trees -- open a file and look!
ptbreader.py: utility program that reads parse trees from text files
grammar.py: all the functions you need to lookup grammar rules/probabilities
tablecell.py: empty class that you will fill in
ckyparser.py: the meat of the program where you will complete get_best_parse(sentence)

Your only job is to fill in tablecell.py and ckyparser.py.

Nutshell Task

Complete get_best_parse() in ckyparser.py. As you write the CKY algorithm, you'll need to complete tablecell.py as the data structure for the CKY table.

Pseudocode

You can do this lab with the lectures and textbook, but CKY pseudocode is extremely extremely helpful (that's two extremelys). The only way that pseudocode backfires is if you blindly try to mimic it without understanding why it does what it does. Debugging is then impossible. Understand the pseudocode before using it, please.

Important Object Types

I provide you two classes: Lexicon and Grammar. These store the rules, words, and probabilities from training on the corpus of parse trees. Use them to write your CKY algorithm. There are also two classes from the NLTK package that we use in this lab. The first is nltk.Tree which you'll use at the end to build your tree from your completed CKY table. The second is nltk.grammar.Production which stores a single rule like 'NP -> JJ NN'. See below for all:

Lexicon (custom to this class)

get_all_tags()        # returns a list of all POS tags
get_rule_probability('NNS','cats')   # gives you the probability of NNS->cats
word_exists('cats')   # True if it exists in the Lexicon
tag_exists('NNS')     # True if it exists in the Lexicon

Grammar (custom to this class)

get_binaryrules_by_left_child('NNS')    # returns a list of nltk.Production rules with 'NNS' as first child
get_binaryrules_by_right_child('NNS')   # returns a list of nltk.Production rules with 'NNS' as second child
get_unaryrules_by_child('NNS')          # returns a list of nltk.Production rules with 'NNS' as the only child

nltk.Production (e.g., NP -> JJ NN)

prod.lhs()   # the LHS non-terminal: wrap it in str(prod.lhs()) to just get a string like 'NP'
prod.rhs()   # a tuple pair of children like (JJ, NN). Again use str(prod.rhs()[0]) for strings

nltk.Tree (only for when your table is completed!)

t = nltk.Tree('NNS', ['cats'])      # makes a tree with a unary rule NNS->cats
s = nltk.Tree('VBZ', ['scratch'])   # makes a tree with a unary rule VBZ->scratch
p = nltk.Tree('S', [t,s])           # makes a tree with an S root and two children subtrees
p.label()   # the LHS non-terminal label as a string
p[0]        # the first child		
p[1]        # the second child
len(p)      # the number of children in this tree

Milestone (due Nov 3)

The first milestone is simply about understanding the code and setting up your double array table. Your task is to fill the table's diagonal with POS tags and unary rules.

You will do this by starting in the get_best_parse(sentence) function in ckyparser.py. As a helper, you'll need to write code in tablecell.py as this is intended for you to create a class that represents each table cell. You will then use the Lexicon object in grammar.py to lookup lexicon rules to fill in the diagonal.

You can run the code on a short test example like this:

python3 ckyparser.py -sub miniTest     # simple run of the program on toy "cat" sentence

Fill in the table's diagonal with POS tags. You must call the print_my_table(table) function at the end of get_best_parse() to prove that you've filled it in. Your output should mirror the following:

python3 ckyparser.py -sub miniTest
Training...
Testing...
CKY: ['cats', 'scratch', 'walls', 'with', 'claws']
0 1
{'NP': 0.4027763060987999, 'P': 0.06249977163602068, 'V': 0.08333302884802758, 'N': 0.4531233443611499}
0 2
{}
0 3
{}
0 4
{}
0 5
{}
1 2
{'NP': 0.0694442485767039, 'P': 0.06249982371903352, 'V': 0.7499978846284022, 'N': 0.0781247796487919}
1 3
{}
1 4
{}
1 5
{}
2 3
{'NP': 0.18055534722350425, 'P': 0.06249992788505917, 'V': 0.08333323718007889, 'N': 0.2031247656264423}
2 4
{}
2 5
{}
3 4
{'NP': 0.0694442485767039, 'P': 1.0624970032235699, 'V': 0.08333309829204469, 'N': 0.0781247796487919}
3 5
{}
4 5
{'NP': 0.2916658440221564, 'P': 0.06249982371903352, 'V': 0.08333309829204469, 'N': 0.328124074524926}

Full Lab

Complete the CKY algorithm in the get_best_parse(sentence) function. Your solution will return an nltk.Tree object containing the parse tree for the given sentence.

Different ways to run the program:

python3 ckyparser.py -sub miniTest        # simple "cats scratch" test for debugging
python3 ckyparser.py                      # medium-sized training run
python3 ckyparser.py -test 1              # parse just one sentence
python3 ckyparser.py -train 20 -test 50   # trains on 20 files instead of 5, then tests on 50 sentences

Table filled for the mini test:

python3 ckyparser.py -sub miniTest
CKY: ['cats', 'scratch', 'walls', 'with', 'claws']
# PRINTING TABLE ONLY FOR DEBUGGING	
0 1
{'N': 0.4531233443611499, 'V': 0.08333302884802758, 'P': 0.06249977163602068, 'NP': 0.4027763060987999}
0 2
{'VP': 0.0038579997133147102, 'PP': 0.004340249677479049}
0 3
{'S': 0.036361605342502434, 'NP': 0.000505022296423645, 'VP|<NP-PP>': 0.004545200667812805}
0 4
{'VP': 8.372361936514405e-06, 'PP': 2.0930904841286013e-06}
0 5
{'S': 0.0056341232123039015, 'NP': 1.7389269173777474e-05, 'VP|<NP-PP>': 0.00015650342256399728}
1 2
{'N': 0.0781247796487919, 'V': 0.7499978846284022, 'P': 0.06249982371903352, 'NP': 0.0694442485767039}
1 3
{'VP': 0.09027741898398323, 'PP': 0.011284677372997905}
1 4
{'S': 0.00026791656088833246, 'NP': 3.348957011104156e-05, 'VP|<NP-PP>': 0.00030140613099937405}
1 5
{'VP': 0.013988219085861191, 'PP': 0.000388561641273922}
2 3
{'N': 0.2031247656264423, 'V': 0.08333323718007889, 'P': 0.06249992788505917, 'NP': 0.18055534722350425}
2 4
{'VP': 0.0038580093582898816, 'PP': 0.004340260528076117}
2 5
{'S': 0.0029256488449005903, 'NP': 0.006217003795413754, 'VP|<NP-PP>': 0.05595303415872379}
3 4
{'N': 0.0781247796487919, 'V': 0.08333309829204469, 'P': 1.0624970032235699, 'NP': 0.0694442485767039}
3 5
{'VP': 0.016203612298887022, 'PP': 0.3098940852162143}
4 5
{'N': 0.328124074524926, 'V': 0.08333309829204469, 'P': 0.06249982371903352, 'NP': 0.2916658440221564}
(S
  (NP (N cats))
  (VP (V scratch) (NP (N walls)) (PP (P with) (NP (N claws)))))
Prec=1.000 Recall=1.000 F1=1.000
OVERALL: Prec=1.000	Recall=1.000	F1=1.000

Expected output for the first real sentence (don't print the table anymore!):

python3 ckyparser.py
Loaded 9648 trees!
Loaded 2012 trees!
Training...
Testing...
CKY: ['Rockwell', 'said', '0', 'the', 'agreement', 'calls', 'for', 'it', 'to', 'supply', '200', 'additional', 'so-called', 'shipsets', 'for', 'the', 'planes', '.']
(S
  (NP (NNP Rockwell))
  (VP
    (VBD said)
    (SBAR
      (-NONE- 0)
      (S
        (NP (DT the) (NN agreement))
        (VP
          (VBZ calls)
          (SBAR
            (IN for)
            (S
              (NP (PRP it))
              (VP
                (TO to)
                (VP
                  (VB supply)
                  (NP
                    (CD 200)
                    (JJ additional)
                    (JJ so-called)
                    (NNS shipsets))
                  (PP (IN for) (NP (DT the) (NNS planes)))))))))))
  (. .))
Prec=1.000 Recall=1.000 F1=1.000

Another full run...

python3 ckyparser.py -train 20 -test 50
...
...
OVERALL: Prec=0.832	Recall=0.827	F1=0.829

Extra Credit (5%)

Implement vertical markovization.

What to turn in

ckyparser.py and tablecell.py

How to turn in

Upload all to our external submission webpage.