Due: next week Oct 20
You've seen the power of word embeddings in the last couple of lectures. Today's purpose is to simply give you hands-on experience with them. The goal is for you to internalize these concepts by manipulating word embeddings yourself with some short programs.
Download the starter code and extract it.
python3 calc.py
I give you a working calculator (3 + 5 - 2) and you must transform it into a word calculator (king - man + woman). Below is some basic code to load pre-trained embeddings and retrieve some word vectors. You can also review my working word lookup program. The returned embeddings are stored in an efficient data type ndarray in the numpy library. All you need to know for now is that it uses normal + and - operators! Here's an example:
import gensim.downloader as api
from gensim.models.word2vec import Word2Vec
embeds = api.load("glove-wiki-gigaword-50") # 66MB
v1 = embeds['good']
v2 = embeds['morning']
gm = v1 + v2
sims = embeds.similar_by_vector(gm)
print(sims)
You'll notice that the first item in "sims" is sometimes one of the original words in your equation, so let's print both the first and second similar words. You'll also need to know how to initialize a numpy vector of zeros:
import numpy
X = numpy.zeros(50)
The above is all you need. I gave you a working number calculator calc.py. Now change it to do words. Your output must match the following:
calc> hello hello goodbye calc> hi hi ho calc> north north south calc> king - man + woman king queen calc> father - man + woman mother daughter calc> mother - woman + man father friend calc> programmer - bad + good programmer versatile calc> programmer - good + bad programmer glitch
For the curious, this page has some great examples of word arithmetic that you can try. The python examples are using a different library called PyTorch, but you can ignore the code and look at the word arithmetic for inspiration.
python3 search.py -data ../lab4/data
In this part, you will build a robust Twitter search engine! Instead of searching by keywords (boring), we'll search by word embeddings (thrilling!) so that we can match strings that use different but related word choices.
search: hello how are you 0.997 @Shockhound35 hey Anson, I'm good, how are you doing? 0.993 @mdmcQ1209 you are the best!!! Forever!!! Sorry you are!! 0.993 @Saulaanator how are you hun? 0.992 you are welcome @sarahshah 0.992 @ImmaMonsterhehe Why are you crying? :( 0.991 dude, how can you not love jongkey? >: ... search: sleepy time 0.964 I'm sleepy but it's early so I'm up 0.958 Can't sleep 0.958 Sleep deprivation! 0.958 Sleep 0.958 Sleep >>> 0.954 @iamjasher go bes. Nap nap time 0.949 Time to bed really exhausted #earlybirds tomorrow search: quit
You must fill in search.py so the output is like the above. Print the top 10 matching tweets with their scores in the exact format above. I gave you starter code to load 50k tweets to search. I also gave you a cosine(v1,v2) function for measuring distance.
The last piece you might need is sorting a bunch of similarity scores. You can store tweets and their scores in a dictionary or as pairs in a list [ (score,tweet), (score2,tweet2), ... ]. The latter is probably easiest, and can be sorted like this:
scored = list() scored.append( (5, 'some tweet') ) scored.append( (2, 'another tweet') ) scored.append( (9, 'yet another tweet') ) scored.sort(reverse=True, key=lambda x:x[0])
Play around with the 50k once you're finished and have fun searching! Don't forget to clean your tweet strings!
calc.py and search.py and readme.txt
Answer the following questions and put them in readme.txt:
Upload all to our external submission webpage.
Login, select SI425 and Lab 5. Click your name and use the 'Upload Submission' option in the menu.