Lab 6: Introduction to PCFG

Due: next week Oct 27

Motivation

Before we build full PCFG parsers, this lab will introduce you to some important aspects of English syntax, and you will build basic probabilities over a few rules that we will loosely define. Next week, you will work with an actual parser and learn real PCFGs.

1. Build a Grammar: Verb Tense and Aspect

Your first task is to write some CFG rules for Verb Phrases (VP). Verbs come in many forms, and your job is to focus on a single verb, leave. You must write grammar rules (e.g., VP -> VBG NP) for each of the following tenses and grammatical aspects using the Penn Treebank POS tagset.

Present tense: "leave", "leaves"
Present perfect: "has left", "have left"
Present progressive: "is leaving", "are leaving", "am leaving"
Past tense: "left"
Past perfect: "had left"
Past progressive: "was leaving", "were leaving"
Future: "will leave"

Each of the above 7 verb categories should have its own unique VP->X X rule! Open a text file, and write the 7 VP grammar rules. You will be graded on how accurately your rules capture the forms, and how they do not allow other forms to match them. IMPORTANT: assume the verb leave has one noun argument, so your rules should have the appropriate NP with it!

2. Add Probability to your Rules

Now estimate the probability P(VP->VBG NP) of each rule, using one file of tweets from the sentiment lab 4 that you downloaded previously (lab4/data/tweets/20111020.txt.gz).

Step 1: Use zgrep (recommended, see below for zgrep help) or write Python code (not recommended) to search and count VP occurrences, and output the count of each. Write the count next to each of your 7 grammar rules. Note that the future and present forms both contain "leave", so take care not to double count across tenses. For example:

zgrep -i "leaving" /courses/nchamber/nlp/lab4/data/tweets/20111020.txt.gz > leaving.txt

This ignores case and searches for "leaving" as a string in the zipped file. It pipes the output to a file. But even better, search for your word with word boundaries to avoid matching substrings of bigger words:

zgrep -i "\bleaving\b" /courses/nchamber/nlp/lab4/data/tweets/20111020.txt.gz > leaving.txt

You can then use tools like wc to count how many matches:

wc -l leaving.txt

Step 2: Calculate the probability of each rule. Remember that P(VP->VBG NP) = P(VBG NP | VP). Your probabilities should of course sum to one! Write the probabilities next to each of your 7 grammar rules.

EXPECTED FORMAT:

After doing the above two steps for one VP rule, you should have something written down along the lines of...

Rule	Count	Prob
VP -> VBD NP	1354	0.16

...and do this for each of your seven rules.

3. Repeat for a second Verb

After you finish the verb leave, pick a different verb that is frequent in English, and compute the probabilities just based on that verb. The VP rules should be unchanged from 'leave' if you did this correctly (except to substitute your new verb in, of course)! Put these in the same text file below your 'leave' rules, write the new verb's counts and the probabilities.

4. Wh-Question Syntax

You will print and fill out 5 of these: PDF template or Word template.

Asking questions in English is somewhat straightforward. There are relatively well-defined rules to transform a normal English sentence into a wh-question. Take this sentence as an example:

"I ate the bread" -> "What did I eat?"

Below are several sentences with a phrase in bold. Your task is to remove that phrase and ask about it using a wh-word. Step one is to rewrite the sentence as a question. Step two is to draw a parse tree for the question. Step three is to come up with the transformation rules to morph the sentence into the question (e.g., "remove the NP, then put 'what' at the beginning of the sentence"). Step four is to find examples in the Twitter data that start with the same wh-words. List 3 examples each, and see if they match your transformation rules. If not, fix your rules before handing it in!

John picked up the chair.
I am going to the big mall tomorrow.
Susan decided to leave John.
Susan decided to leave John.
We thought about eating the burritos. (not a wh-word question. make it a yes/no question)

Helpful POS Tags

VBZ - 3rd person present (leaves, is)
VBP - non-3rd person present (leave, am, are)
VBG - gerund (leaving)
VBD - past tense verb (left, was, were, had)
VBN - past participle (left)
VB - base verb (to leave)
MD - will

Helpful grep

Use zgrep on the zipped files. Don't unzip them.

zgrep " apple ": This searches for 'apple' with spaces on either side.

zgrep "\bapple\b": Even better. Searches for 'apple' with word boundaries on both sides. In other words, if it starts or ends a sentence, or has punctuation, apple will still match!

zgrep "pattern" | wc -l: This searches for your pattern, and then pipes the lines to wc. wc counts lines for you. It couldn't be easier!

What to turn in

One printed page containing two grammars of VP rules. One for the verb leave and one for your verb of choice. Both should have probabilities attached to the rules, as well as the raw counts of each rule to see how you computed your probabilities (and for partial credit).
Printouts of your Wh-Question answers (5 sheets). Use the template linked to above for easiest formatting.

How to turn in

No auto-submit. Print it out, staple, and bring to class on the due date.

Grading

Verb Phrases 60%

"leave" Grammar: 10%
"leave" Counts/Probabilities: 30%
Second Verb and Counts/Probabilities: 20%

Wh-Questions 40%

Parse trees: 20%
Transformation rules: 10%
Data Examples: 10%

Total: 100%