Lab Due: Nov 21
We spent the past week looking at parts of speech and syntactic trees, as well as the complexities behind choosing the correct human interpretation. Today you will explore how parts of speech and syntax can help extract concepts from a lot of text. This is a case study for when you might want to use these concepts! In today's age of LLMs, sometimes you want a simpler solution that runs quickly and doesn't use expensive LLMs.
You will download tweets from US Senators during the first year of President Biden's presidency. Your objective is to identify the Top 20 topics that the Republicans and Democrats liked to tweet about, but of which the other party did not. You want to identify the noun phrases from each party, and then filter based on relative frequency differences. You'll print out the best noun phrases of each!
Generative AI: GenAI tools are permitted as code assistants on this lab (but you are not required or expected to use them). Your grade on this lab is based on your ability to code within the given frameworks, and to achieve the lab outcomes. All GenAI usage must be cited in the code through clear code comments.
This lab just uses the nltk and datasets libraries. To save time, we can reuse your BERT environment from lab 6. But run the install command just to make sure they are installed.
conda activate bert conda install nltk datasets
If this does not work or you get errors, just create a new environment! That's the beauty of conda/mamba environments:
conda create -n politics conda activate politics conda install nltk datasets
You may use GenAI tools, if you desire. However, this lab will be graded differently. If I see lines of code that are irrelevant to your solution, or lines of code that are overly complex with no clear benefit, you will lose software design points.
This new policy is based on some recent submissions in the past two labs. Some of you are blindly pasting sloppy code, and even if it does not lead to immediate bugs, it is a bad habit to fall into. Imagine pasting code that you don't fully understand into an actual secure system, and how you could put the entire organization at risk. Understand what you use before you use it.
Below is a section on "tools you need" for part of speech tagging. Using these tools, your task is to count all noun phrases in the republican party, and separately count all noun phrases in the democratic party. You should use a dictionary for each party where the keys are noun phrases, and the values are their frequency counts.
After counting, trim each dictionary by removing all phrases that were only seen once in that party. This removes a large chunk of the overall counts.
After trimming, convert each party's dictionary of counts into a probability distribution over that party's overall counts. You will then have noun phrases as keys, and their values are probabilities. After this step, you should have one dictionary for republican probabilities, and one dictionary for democratic probabilities.
Finally, print out the Top 20 noun phrases for republicans based on highest probability, but only if their probability is 2 times greater than the democratic party's probability for that tag. After printing republicans, do the inverse and print the Top 20 phrases for democrats that are also 2 times greater than the republican's probability.
When running your program, it should ONLY print the following in this exact format. You will lose points if you print other things before/after this, such as debugging print statements:
> python3 chunkanalysis.py [nltk_data] Downloading package punkt_tab to /Users/nate/nltk_data... [nltk_data] Package punkt_tab is already up-to-date! [nltk_data] Downloading package averaged_perceptron_tagger_eng to [nltk_data] /Users/nate/nltk_data... [nltk_data] Package averaged_perceptron_tagger_eng is already up-to- [nltk_data] date! ***** Republicans ***** parade practice 0.0047514 classes 0.0045234 homework assignments 0.003614 ... ***** Democrats ***** dress blues 0.0068234 leave policies 0.005243 chits 0.004343 ...
Create a file called readme.txt and write a detailed paragraph of analysis of the key noun phrases that your program discovers. Do they make sense? Do you see groups of phrases? Are any of them confusing?
# Import the datasets library from datasets import load_dataset ... # Downloads the tweets into a dataframe of rows with 'text' and 'party' fields tweets = load_dataset("m-newhauser/senator-tweets")['train'] # Loop over each row, grabs the tweet text and party affiliation. for row in tweets: tweet = row['text'] party = row['party']
# Import nltk and download needed software import nltk nltk.download('punkt_tab') nltk.download('averaged_perceptron_tagger_eng') ... # Tokenize and POS-tag a sentence tokens = nltk.word_tokenize("This is a sentence that will be tagged.") tagged = nltk.pos_tag(tokens) print(tagged) # debugging print() to see what it is!
Chunking is the process of taking POS tags, and merging the ones that are phrases. You do this with a basic grammar written as regex expressions. Notice this is NOT full syntactic parsing with the CKY algorithm. This is just a quick regex matching of tags, chunking phrases together.
# A grammar with one rule (an NP rule) grammar = r""" NP: {<DT>?<JJ>*<NN.*>+} """ chunk_parser = nltk.RegexpParser(grammar) # Chunk the tags! Loop and print. chunked = chunk_parser.parse(tagged) # 'tagged' is the variable from tagging above for subtree in chunked.subtrees(): if subtree.label() == "NP": for wordtag in subtree.leaves(): print(wordtag) # debug print to see what it is!
Run the same analysis for Verb phrases. Submit topverbs.txt with your output.
Copy your program output to a file named topnouns.txt
Submit chunkanalysis.py and readme.txt and topnouns.txt
Upload all to our submission webpage.
Login, select SI425 and Lab08. Click your name and use the 'Upload Submission' option in the menu.