Your lab today explores one of the most buzzworthy sources of data -- social media. Whereas many data formats are numeric or organized nicely into columns and rows, social media has its data encoded in natural language. The "data" is expressed with words and phrases. The "meaning" is hidden in this human format full of ambiguity, sarcasm, metaphor, and even emoticons. We'll look at some basic tools to process language and attempt to discover insights from it.
In today's lab, you will perform ... Data Processing, Data Storage, Data Visualization.
Create a new folder lab07 inside your SD211 folder for this lab. Create an initial Python program called tweets.py.
Download usna-tweets.tgz into your lab folder. Unzip it ("tar -xvf usna-tweets.tgz").
Open one of the files. You'll see each row contains one tweet with its timestamp, separated by a tab character '\t'
2021-09-02T13:40:28.000Z The US Naval Academy Chapel this morning just before dawn. A beautiful, dry, cool morning after a day of torrential thunderstorms and a tornado. https://t.co/tWzLzav3iT
These contain all tweets that mentioned "Naval Academy", "midshipman", or "USNA" in two academic year time periods:
Sep 1,2019-May 15,2020 (AY2020) and Sep 1,2021-May 15,2022 (AY2022)
Today uses a wordcloud library along with a visualizer matplotlib. Install them from your terminal:
conda install -c conda-forge matplotlib wordcloud
Write your first program, clean.py Your program must ask the user for a filename, open the file, loop over its lines, and print the tweet text out cleaned. This means you will alter each tweet string to remove it of punctuation, URLs, and other noise.
Below shows a few tweets in the sample file, followed by your expected program output:
2021-09-25T16:04:32.000Z Online Notary Spicer Shimmies Into Federal Court Demanding Attention And Spot On Naval Academy Board https://t.co/ufdkt5tCyV onlinenotaryexperts 2022-02-03T20:22:26.000Z @thehill @RepCawthorn Hey Mandy, did you learn about “fighters” at the Naval Academy…oh that’s right?! Haha 😂 2021-11-12T00:04:41.000Z Rep. Elaine Luria, #VA02, Democrat US Navy 🇺🇸Graduated from the US Naval Academy 🇺🇸Engineer, nuclear reactors 🇺🇸1st female US sailor to spend entire career on combat ships 🇺🇸Retired rank: Commander #wtpBLUE #wtp1119 #DemsDeliver https://t.co/LQyBuu9EFQ 2022-03-29T00:44:01.000Z @CawthornforNC Omg did he lie about getting into the Naval Academy? 2022-03-03T11:16:33.000Z Airdrops worth 100,000$ #NFTs and 500 #whitelist come and participate https://t.co/5wj0MgWFx8 @menunggubigwiin @missubigwinner @NFTethh @saffcody @pickme_nailucky @bigwin_fatimah @jaheecityx @usna_bigwin @favparahost @akubigwinper1 2021-10-28T05:28:26.000Z @CawthornforNC You getting into the Naval Academy ...
Your program output:
> python3 clean.py Filename? sample.tsv online notary spicer shimmies into federal court demanding attention and spot on board onlinenotaryexperts @thehill @repcawthorn hey mandy did you learn about fighters at the academyoh that’s right haha 😂 rep elaine luria #va02 democrat us 🇺🇸graduated from the us 🇺🇸engineer nuclear reactors 🇺🇸1st female us sailor to spend entire career on combat ships 🇺🇸retired rank commander #wtpblue #wtp1119 #demsdeliver @cawthornfornc omg did he lie about getting into the airdrops worth 100000$ #nfts and 500 #whitelist come and participate @menunggubigwiin @missubigwinner @nftethh @saffcody @pickme_nailucky @bigwin_fatimah @jaheecityx @usna_bigwin @favparahost @akubigwinper1 @cawthornfornc you getting into the ...
For this program, you must write and use a function called clean_tweet(string) that accepts one string argument, and returns a cleaned copy of it. The main part of your program will ask the user for a filename, read the lines from the file, and simply call your clean_tweet function on each line to get a cleaned version.
You must do the following to get full credit for cleaning in your clean_tweet function:
HINT: to do #4 and #5 above, you need to look at each individual word -- so split the string into words, and then build a new string by concatenating the words back together again, skipping things you don't want.
You used string functions in the chatbot lab, but here is a reminder of functions that you'll need.
Copy clean.py to tweets.py
This step will stop printing tweets to the terminal, and instead call a popular word cloud library to visualize the discussion. Your job is to look at how the word cloud library can be used, and then adjust your program to create a cloud using all the cleaned tweets!
I'm showing you an example piece of code on how to use the library. Look for the doc variable. In this example, that's the string with all the words in it:
# Don't forget your library imports!
import matplotlib.pyplot as plt # we had this one before
from wordcloud import WordCloud # new for WordCloud
# The cloud! 'doc' is a long string of words
doc = "This is a long string with happy words to put in a visual word cloud...blah blah...it makes repeated words bigger than single occurrence words. It splits all the words for you, easy peasy."
cloud = WordCloud(width=480, height=480, margin=0).generate(doc)
# Now popup the display of our generated cloud image.
# You probably don't need to adjust these details.
plt.imshow(cloud, interpolation='bilinear')
plt.axis("off")
plt.margins(x=0, y=0)
plt.show()
Write a function create_cloud(lines) that takes a List of strings from your tweets file, and draws the word cloud. You should give this function the raw list of strings straight from the file. Make sure you clean those strings inside this function before making the word cloud!
After you generate a word cloud for this step, spend a minute looking at it. What are the biggest words that are shown? Are they useful or uninformative? Add the big uninformative tokens/words/letters to your stop word list until the cloud only shows meaningful words. For instance, "US" isn't useful in understanding this, right? Of course "US" appears here because people write "us naval academy". That doesn't help us understand the topics that are discussed. What else do you see that should be hidden? You should see a few that can be added to your stop word list. If you do this right, then better words and phrases should be easier to see.
Copy tweets.py to tweetstime.py
Your word cloud from the prior step shows word usage over a 9 month period. There is clearly one topic that dominates everything, but perhaps that just spiked during a short time frame, and we can't see what else was discussed about the Naval Academy.
Change your program (now in tweetstime.py) to be more useful by asking the user for a start and end time. You will then make a word cloud that only uses tweets in that time period. You've been ignoring the dates of tweets in this lab, but now you'll need to use that information when you split your lines from the file.
Change your create_cloud(lines) function to now require create_cloud(lines, start_date, end_date). The date parameters here must be of type datetime.date which I'll describe below.
But in short, a working program must follow this exact input/output:
> python3 clean.py Filename? tweets.tsv Start date (yyyy-mm-dd)? 2021-12-20 End date (yyyy-mm-dd)? 2022-02-20
How about datetime.date variables? Here the user types in a string version of a date. Let's convert that to a Python date/time object like so:
from datetime import datetime
x = datetime.strptime('2021-12-20', '%Y-%m-%d')
y = datetime.strptime('2021-12-25', '%Y-%m-%d')
Once you do this, Python can do the date comparisons for you!
if y > x:
print('The second date is later!')
else:
print('The first date is later!')
The user's dates don't have times, but sometimes you'll have both, like in our tweets data file. You just need to change the second format string to match the format, like so:
z = datetime.strptime('2022-03-31T23:42:25.000Z', '%Y-%m-%dT%H:%M:%S.000Z')
Given these tools, you can now write your create_cloud(lines, start_date, end_date) function and complete this step.
This is a small but useful change to Step 3. Keep the same functionality, but make your program loop until the user enters "quit" for the start date. Your terminal should look like the following, with a new word cloud appearing after each start/end date is entered. You may assume the user closes the word cloud when ready to enter a new date range.
> python3 clean.py Filename? tweets.tsv Start date (yyyy-mm-dd)? 2021-12-20 End date (yyyy-mm-dd)? 2022-02-20 Start date (yyyy-mm-dd)? 2022-03-01 End date (yyyy-mm-dd)? 2022-04-01 Start date (yyyy-mm-dd)? 2021-10-01 End date (yyyy-mm-dd)? 2022-10-14 Start date (yyyy-mm-dd)? quit
Use your program to find trending topics about USNA during these time periods. Find unique periods of time that show a word cloud distinct from most other periods, and use the Web to lookup what it was about if you don't immediately know.
Create a Word doc and save it as a pdf (README.pdf) in which you copy/paste at least FOUR different word clouds, and a brief written description of what happened during that time to generate the Twitter discussion around each. Each description should be a normal paragraph in length, accompanied by the word cloud. Make sure to include the date range from which you generated it.
Your 3 programs (clean.py and tweets.py and tweetstime.py) and your README.pdf with your cloud analysis.
Use the command-line to submit your files:
submit -c=sd211 -p=lab07 clean.py tweets.py tweetstime.py README.pdf
...or if you're in the lab not your laptop, you can visit the submit website and upload the files.