Milestone Due: next week Nov 3
Lab Due: next week Nov 13
We have studied syntax and parse trees for a bit, as well as the complexities behind choosing the correct human interpretation. Today you will implement the CKY algorithm to instead have your computer automatically parse sentences for you. You'll see they are much quicker and often better than a linguistic novice like yourself.
Download the starter code and data and extract it.
data/: lots of parse trees -- open a file and look! ptbreader.py: utility program that reads parse trees from text files grammar.py: all the functions you need to lookup grammar rules/probabilities tablecell.py: empty class that you will fill in ckyparser.py: the meat of the program where you will complete get_best_parse(sentence) Your only job is to fill in tablecell.py and ckyparser.py.
Complete get_best_parse() in ckyparser.py. As you write the CKY algorithm, you'll need to complete tablecell.py as the data structure for the CKY table.
You can do this lab with the lectures and textbook, but CKY pseudocode is extremely extremely helpful (that's two extremelys). The only way that pseudocode backfires is if you blindly try to mimic it without understanding why it does what it does. Debugging is then impossible. Understand the pseudocode before using it, please.
I provide you two classes: Lexicon and Grammar. These store the rules, words, and probabilities from training on the corpus of parse trees. Use them to write your CKY algorithm. There are also two classes from the NLTK package that we use in this lab. The first is nltk.Tree which you'll use at the end to build your tree from your completed CKY table. The second is nltk.grammar.Production which stores a single rule like 'NP -> JJ NN'. See below for all:
Lexicon (custom to this class)
get_all_tags() # returns a list of all POS tags
get_rule_probability('NNS','cats') # gives you the probability of NNS->cats
word_exists('cats') # True if it exists in the Lexicon
tag_exists('NNS') # True if it exists in the Lexicon
Grammar (custom to this class)
get_binaryrules_by_left_child('NNS') # returns a list of nltk.Production rules with 'NNS' as first child
get_binaryrules_by_right_child('NNS') # returns a list of nltk.Production rules with 'NNS' as second child
get_unaryrules_by_child('NNS') # returns a list of nltk.Production rules with 'NNS' as the only child
nltk.Production (e.g., NP -> JJ NN)
prod.lhs() # the LHS non-terminal: wrap it in str(prod.lhs()) to just get a string like 'NP' prod.rhs() # a tuple pair of children like (JJ, NN). Again use str(prod.rhs()[0]) for strings
nltk.Tree (only for when your table is completed!)
t = nltk.Tree('NNS', ['cats']) # makes a tree with a unary rule NNS->cats
s = nltk.Tree('VBZ', ['scratch']) # makes a tree with a unary rule VBZ->scratch
p = nltk.Tree('S', [t,s]) # makes a tree with an S root and two children subtrees
p.label() # the LHS non-terminal label as a string
p[0] # the first child
p[1] # the second child
len(p) # the number of children in this tree
The first milestone is simply about understanding the code and setting up your double array table. Your task is to fill the table's diagonal with POS tags and unary rules.
You will do this by starting in the get_best_parse(sentence) function in ckyparser.py. As a helper, you'll need to write code in tablecell.py as this is intended for you to create a class that represents each table cell. You will then use the Lexicon object in grammar.py to lookup lexicon rules to fill in the diagonal.
You can run the code on a short test example like this:
python3 ckyparser.py -sub miniTest # simple run of the program on toy "cat" sentence
Fill in the table's diagonal with POS tags. You must call the print_my_table(table) function at the end of get_best_parse() to prove that you've filled it in. Your output should mirror the following:
python3 ckyparser.py -sub miniTest
Training...
Testing...
CKY: ['cats', 'scratch', 'walls', 'with', 'claws']
0 1
{'NP': 0.4027763060987999, 'P': 0.06249977163602068, 'V': 0.08333302884802758, 'N': 0.4531233443611499}
0 2
{}
0 3
{}
0 4
{}
0 5
{}
1 2
{'NP': 0.0694442485767039, 'P': 0.06249982371903352, 'V': 0.7499978846284022, 'N': 0.0781247796487919}
1 3
{}
1 4
{}
1 5
{}
2 3
{'NP': 0.18055534722350425, 'P': 0.06249992788505917, 'V': 0.08333323718007889, 'N': 0.2031247656264423}
2 4
{}
2 5
{}
3 4
{'NP': 0.0694442485767039, 'P': 1.0624970032235699, 'V': 0.08333309829204469, 'N': 0.0781247796487919}
3 5
{}
4 5
{'NP': 0.2916658440221564, 'P': 0.06249982371903352, 'V': 0.08333309829204469, 'N': 0.328124074524926}
Complete the CKY algorithm in the get_best_parse(sentence) function. Your solution will return an nltk.Tree object containing the parse tree for the given sentence.
Different ways to run the program:
python3 ckyparser.py -sub miniTest # simple "cats scratch" test for debugging python3 ckyparser.py # medium-sized training run python3 ckyparser.py -test 1 # parse just one sentence python3 ckyparser.py -train 20 -test 50 # trains on 20 files instead of 5, then tests on 50 sentences
Table filled for the mini test:
python3 ckyparser.py -sub miniTest
CKY: ['cats', 'scratch', 'walls', 'with', 'claws']
# PRINTING TABLE ONLY FOR DEBUGGING
0 1
{'N': 0.4531233443611499, 'V': 0.08333302884802758, 'P': 0.06249977163602068, 'NP': 0.4027763060987999}
0 2
{'VP': 0.0038579997133147102, 'PP': 0.004340249677479049}
0 3
{'S': 0.036361605342502434, 'NP': 0.000505022296423645, 'VP|<NP-PP>': 0.004545200667812805}
0 4
{'VP': 8.372361936514405e-06, 'PP': 2.0930904841286013e-06}
0 5
{'S': 0.0056341232123039015, 'NP': 1.7389269173777474e-05, 'VP|<NP-PP>': 0.00015650342256399728}
1 2
{'N': 0.0781247796487919, 'V': 0.7499978846284022, 'P': 0.06249982371903352, 'NP': 0.0694442485767039}
1 3
{'VP': 0.09027741898398323, 'PP': 0.011284677372997905}
1 4
{'S': 0.00026791656088833246, 'NP': 3.348957011104156e-05, 'VP|<NP-PP>': 0.00030140613099937405}
1 5
{'VP': 0.013988219085861191, 'PP': 0.000388561641273922}
2 3
{'N': 0.2031247656264423, 'V': 0.08333323718007889, 'P': 0.06249992788505917, 'NP': 0.18055534722350425}
2 4
{'VP': 0.0038580093582898816, 'PP': 0.004340260528076117}
2 5
{'S': 0.0029256488449005903, 'NP': 0.006217003795413754, 'VP|<NP-PP>': 0.05595303415872379}
3 4
{'N': 0.0781247796487919, 'V': 0.08333309829204469, 'P': 1.0624970032235699, 'NP': 0.0694442485767039}
3 5
{'VP': 0.016203612298887022, 'PP': 0.3098940852162143}
4 5
{'N': 0.328124074524926, 'V': 0.08333309829204469, 'P': 0.06249982371903352, 'NP': 0.2916658440221564}
(S
(NP (N cats))
(VP (V scratch) (NP (N walls)) (PP (P with) (NP (N claws)))))
Prec=1.000 Recall=1.000 F1=1.000
OVERALL: Prec=1.000 Recall=1.000 F1=1.000
Expected output for the first real sentence (don't print the table anymore!):
python3 ckyparser.py
Loaded 9648 trees!
Loaded 2012 trees!
Training...
Testing...
CKY: ['Rockwell', 'said', '0', 'the', 'agreement', 'calls', 'for', 'it', 'to', 'supply', '200', 'additional', 'so-called', 'shipsets', 'for', 'the', 'planes', '.']
(S
(NP (NNP Rockwell))
(VP
(VBD said)
(SBAR
(-NONE- 0)
(S
(NP (DT the) (NN agreement))
(VP
(VBZ calls)
(SBAR
(IN for)
(S
(NP (PRP it))
(VP
(TO to)
(VP
(VB supply)
(NP
(CD 200)
(JJ additional)
(JJ so-called)
(NNS shipsets))
(PP (IN for) (NP (DT the) (NNS planes)))))))))))
(. .))
Prec=1.000 Recall=1.000 F1=1.000
Another full run...
python3 ckyparser.py -train 20 -test 50
...
...
OVERALL: Prec=0.832 Recall=0.827 F1=0.829
Implement vertical markovization.
ckyparser.py and tablecell.py
Upload all to our external submission webpage.
Login, select SI425 and Lab07. Click your name and use the 'Upload Submission' option in the menu.