Lab 5: word2vec

Due: next week Oct 20

Motivation

You've seen the power of word embeddings in the last couple of lectures. Today's purpose is to simply give you hands-on experience with them. The goal is for you to internalize these concepts by manipulating word embeddings yourself with some short programs.

Starter Code

Download the starter code and extract it.

Part 1

python3 calc.py

I give you a working calculator (3 + 5 - 2) and you must transform it into a word calculator (king - man + woman). Below is some basic code to load pre-trained embeddings and retrieve some word vectors. You can also review my working word lookup program. The returned embeddings are stored in an efficient data type ndarray in the numpy library. All you need to know for now is that it uses normal + and - operators! Here's an example:

import gensim.downloader as api
from gensim.models.word2vec import Word2Vec

embeds = api.load("glove-wiki-gigaword-50") # 66MB

v1 = embeds['good']	  
v2 = embeds['morning']	  
gm = v1 + v2
sims = embeds.similar_by_vector(gm)
	  
print(sims)

You'll notice that the first item in "sims" is sometimes one of the original words in your equation, so let's print both the first and second similar words. You'll also need to know how to initialize a numpy vector of zeros:

import numpy
X = numpy.zeros(50)

The above is all you need. I gave you a working number calculator calc.py. Now change it to do words. Your output must match the following:

calc> hello
hello goodbye
calc> hi
hi ho
calc> north
north south
calc> king - man + woman
king queen
calc> father - man + woman
mother daughter
calc> mother - woman + man
father friend
calc> programmer - bad + good
programmer versatile
calc> programmer - good + bad
programmer glitch

For the curious, this page has some great examples of word arithmetic that you can try. The python examples are using a different library called PyTorch, but you can ignore the code and look at the word arithmetic for inspiration.

Part 2

python3 search.py -data ../lab4/data

In this part, you will build a robust Twitter search engine! Instead of searching by keywords (boring), we'll search by word embeddings (thrilling!) so that we can match strings that use different but related word choices.

search: hello how are you
0.997	@Shockhound35 hey Anson, I'm good, how are you doing?
0.993	@mdmcQ1209 you are the best!!! Forever!!! Sorry you are!!
0.993	@Saulaanator how are you hun?
0.992	you are welcome @sarahshah
0.992	@ImmaMonsterhehe Why are you crying? :(
0.991	dude, how can you not love jongkey? >:
...
search: sleepy time
0.964	I'm sleepy but it's early so I'm up
0.958	Can't sleep
0.958	Sleep deprivation!
0.958	Sleep
0.958	Sleep >>>
0.954	@iamjasher go bes. Nap nap time
0.949	Time to bed really exhausted #earlybirds tomorrow
search: quit

You must fill in search.py so the output is like the above. Print the top 10 matching tweets with their scores in the exact format above. I gave you starter code to load 50k tweets to search. I also gave you a cosine(v1,v2) function for measuring distance.

The last piece you might need is sorting a bunch of similarity scores. You can store tweets and their scores in a dictionary or as pairs in a list [ (score,tweet), (score2,tweet2), ... ]. The latter is probably easiest, and can be sorted like this:

scored = list()
scored.append( (5, 'some tweet') )
scored.append( (2, 'another tweet') )
scored.append( (9, 'yet another tweet') )      
scored.sort(reverse=True, key=lambda x:x[0])

Play around with the 50k once you're finished and have fun searching! Don't forget to clean your tweet strings!

What to turn in

calc.py and search.py and readme.txt

Answer the following questions and put them in readme.txt:

Discover 3 interesting (and new) word arithmetic equations for Part 1. Copy/paste them + output into readme.txt.
"doctor - man + woman" produces a result that has been used as an example of how AI might learn to be sexist (there are examples for other dimensions of our shared humanity too). You might try experimenting with other combinations. The question here that you must answer for me: what aspects of the wikipedia/news dataset might have helped to "teach" the learner so that it produces this biased result? Hypothesize what guided it to this conclusion (hint: "more women are nurses" is not an answer as this is not part of word2vec's learning algorithm).
Discover 2 interesting multi-word searches for Part 2 that work well (discovers different but related words), and discover 2 searches that don't work well (irrelevant matches). Show me both and explain why the poor results might have been returned/learned.

How to turn in

Upload all to our external submission webpage.