Lab 8: CKY Parsing

Milestone Due: next week Apr 18
Lab Due: Apr 25

Motivation

We have studied syntax and parse trees for a bit, as well as the complexities behind choosing the correct human interpretation. Today you will implement the CKY algorithm to instead have your computer automatically parse sentences for you. You'll see they are much quicker and often better than a linguistic novice like yourself.

Starter Code

Download the starter code and data and extract it.

data/: lots of parse trees -- open a file and look!
ptbreader.py: utility program that reads parse trees from text files
grammar.py: all the functions you need to lookup grammar rules/probabilities
tablecell.py: empty class that you will fill in
ckyparser.py: the meat of the program where you will complete get_best_parse(sentence)

Your only job is to fill in tablecell.py and ckyparser.py.

Nutshell Task

Complete get_best_parse() in ckyparser.py. As you write the CKY algorithm, you'll need to complete tablecell.py as the data structure for the CKY table.

Pseudocode

You can do this lab with the lectures and textbook, but CKY pseudocode is extremely extremely helpful (that's two extremelys). The only way that pseudocode backfires is if you blindly try to mimic it without understanding why it does what it does. Debugging is then impossible. Understand the pseudocode before using it, please.

Important Object Types

I provide you two classes: Lexicon and Grammar. These store the rules, words, and probabilities from training on the corpus of parse trees. Use them to write your CKY algorithm. There are also two classes from the NLTK package that we use in this lab. The first is nltk.Tree which you'll use at the end to build your tree from your completed CKY table. The second is nltk.grammar.Production which stores a single rule like 'NP -> JJ NN'. See below for all:

Lexicon (custom to this class)

get_all_tags()        # returns a list of all POS tags
get_rule_probability('NNS','cats')   # gives you the probability of NNS->cats
word_exists('cats')   # True if it exists in the Lexicon
tag_exists('NNS')     # True if it exists in the Lexicon

Grammar (custom to this class)

# returns a list pairs: (nltk.Production rule with 'NNS' as first child, its rule probability)
get_binaryrules_by_left_child('NNS')    
get_binaryrules_by_right_child('NNS')   # nltk.Production rules with 'NNS' as second child
get_unaryrules_by_child('NNS')          # returns a list of nltk.Production rules with 'NNS' as the only child

nltk.Production (e.g., NP -> JJ NN)

prod.lhs()   # the LHS non-terminal: wrap it in str(prod.lhs()) to just get a string like 'NP'
prod.rhs()   # a tuple pair of children like (JJ, NN). Again use str(prod.rhs()[0]) for strings

nltk.Tree (only for when your table is completed!)

t = nltk.Tree('NNS', ['cats'])      # makes a tree with a unary rule NNS->cats
s = nltk.Tree('VBZ', ['scratch'])   # makes a tree with a unary rule VBZ->scratch
p = nltk.Tree('S', [t,s])           # makes a tree with an S root and two children subtrees
p.label()   # the LHS non-terminal label as a string
p[0]        # the first child		
p[1]        # the second child
len(p)      # the number of children in this tree

Milestone (due Apr 18)

Setup: Copy the ckyparser.py file to a temporary milestone1.py file until you complete this first step. Submit this temp program when finished for milestone credit. Then copy back to ckyparser.py to complete the lab.

The first milestone is simply about understanding the code and setting up your double array table. Your task is to fill the table's diagonal with POS tags and unary rules.

You will do this by starting in the get_best_parse(sentence) function in ckyparser.py. As a helper, you'll need to write code in tablecell.py as this is intended for you to create a class that represents each table cell. You will then use the Lexicon object in grammar.py to lookup lexicon rules to fill in the diagonal.

You can run the code on a short test example like this:

python3 ckyparser.py -sub miniTest     # simple run of the program on toy "cat" sentence

Fill in the table's diagonal with POS tags. You must call the print_my_table(table) function at the end of get_best_parse() to prove that you've filled it in. Your output should mirror the following:

python3 ckyparser.py -sub miniTest
Training...
Testing...
CKY: ['cats', 'scratch', 'walls', 'with', 'claws']
0 0
{}
0 1
{'N': 0.374999723557831, 'V': 8.333309829111877e-07, 'P': 6.249982371833908e-07, 'NP': 0.3333330876069609}
0 2
{}
0 3
{}
0 4
{}
0 5
{}
1 1
{}
1 2
{'N': 7.812490985437086e-07, 'V': 0.6666667307563363, 'P': 6.249992788349668e-07, 'NP': 6.944436431499632e-07}
1 3
{}
1 4
{}
1 5
{}
2 2
{}
2 3
{'N': 0.1250012620067771, 'V': 8.333365383584828e-07, 'P': 6.250024037688622e-07, 'NP': 0.11111223289491298}
2 4
{}
2 5
{}
3 3
{}
3 4
{'N': 7.812490985437086e-07, 'V': 8.333323717799558e-07, 'P': 0.9999994711352257, 'NP': 6.944436431499632e-07}
3 5
{}
4 4
{}
4 5
{'N': 0.2500004927830853, 'V': 8.333323717799558e-07, 'P': 6.249992788349668e-07, 'NP': 0.22222266025163134}

Submit milestone1.py and tablecell.py to the submit system.

Full Lab

Complete the CKY algorithm in the get_best_parse(sentence) function. Your solution will return an nltk.Tree object containing the parse tree for the given sentence.

Different ways to run the program:

python3 ckyparser.py -sub miniTest        # simple "cats scratch" test for debugging
python3 ckyparser.py                      # medium-sized training run
python3 ckyparser.py -test 1              # parse just one sentence
python3 ckyparser.py -train 20 -test 50   # trains on 20 files instead of 5, then tests on 50 sentences

Table filled for the mini test:

python3 ckyparser.py -sub miniTest
Training...
Testing...
CKY: ['cats', 'scratch', 'walls', 'with', 'claws']
# PRINTING TABLE ONLY FOR DEBUGGING
0 0
{}
0 1
{'N': 0.374999723557831, 'V': 8.333309829111877e-07, 'P': 6.249982371833908e-07, 'NP': 0.3333330876069609}
0 2
{'VP': 3.858009358150566e-13, 'PP': 4.3402605279193865e-13}
0 3
{'S': 0.016461060987704706, 'NP': 2.5720375642819054e-09, 'VP|<NP-PP>': 2.314833807853715e-08}
0 4
{'VP': 8.372438681717783e-26, 'PP': 2.0931096704294457e-26}
0 5
{'S': 0.0018290094143268807, 'NP': 6.350719194680454e-11, 'VP|<NP-PP>': 5.715647275212408e-10}
1 1
{}
1 2
{'N': 7.812490985437086e-07, 'V': 0.6666667307563363, 'P': 6.249992788349668e-07, 'NP': 6.944436431499632e-07}
1 3
{'VP': 0.049383219367392185, 'PP': 6.944506542906349e-08}
1 4
{'S': 2.679187934816875e-19, 'NP': 3.3489849185210943e-20, 'VP|<NP-PP>': 3.014086426668985e-19}
1 5
{'VP': 0.005487032287906261, 'PP': 1.7146954466013983e-09}
2 2
{}
2 3
{'N': 0.1250012620067771, 'V': 8.333365383584828e-07, 'P': 6.250024037688622e-07, 'NP': 0.11111223289491298}
2 4
{'VP': 3.8580350777842917e-13, 'PP': 4.340289462507329e-13}
2 5
{'S': 1.3717570827551186e-08, 'NP': 0.0027435158802065905, 'VP|<NP-PP>': 0.024691642921859318}
3 3
{}
3 4
{'N': 7.812490985437086e-07, 'V': 8.333323717799558e-07, 'P': 0.9999994711352257, 'NP': 6.944436431499632e-07}
3 5
{'VP': 1.234568910204955e-07, 'PP': 0.22222254272589428}
4 4
{}
4 5
{'N': 0.2500004927830853, 'V': 8.333323717799558e-07, 'P': 6.249992788349668e-07, 'NP': 0.22222266025163134}
(S
  (NP (N cats))
  (VP (V scratch) (NP (N walls)) (PP (P with) (NP (N claws)))))
Prec=1.000 Recall=1.000 F1=1.000
OVERALL: Prec=1.000	Recall=1.000	F1=1.000

Expected output for the first real sentence (don't print the table anymore!):

python3 ckyparser.py
Loaded 9648 trees!
Loaded 2012 trees!
Training...
Testing...
CKY: ['Rockwell', 'said', '0', 'the', 'agreement', 'calls', 'for', 'it', 'to', 'supply', '200', 'additional', 'so-called', 'shipsets', 'for', 'the', 'planes', '.']
(S
  (NP (NNP Rockwell))
  (VP
    (VBD said)
    (SBAR
      (-NONE- 0)
      (S
        (NP (DT the) (NN agreement))
        (VP
          (VBZ calls)
          (SBAR
            (IN for)
            (S
              (NP (PRP it))
              (VP
                (TO to)
                (VP
                  (VB supply)
                  (NP
                    (CD 200)
                    (JJ additional)
                    (JJ so-called)
                    (NNS shipsets))
                  (PP (IN for) (NP (DT the) (NNS planes)))))))))))
  (. .))
Prec=1.000 Recall=1.000 F1=1.000

Another full run...

python3 ckyparser.py -train 20 -test 50
...
...
OVERALL: Prec=0.833	Recall=0.849	F1=0.846

PREVIOUSLY - Prec and Recall were 0.83 before finding a bug in add_unaries!

Extra Credit (5%)

Implement vertical markovization.

What to turn in

milestone1.py, ckyparser.py and tablecell.py

How to turn in

Upload all to our submission webpage.