Lab 5: word2vec

Due: Oct 17

Motivation

You've seen the power of word embeddings in the last couple of lectures. Today's purpose is to simply give you hands-on experience with them. The goal is for you to internalize these concepts by manipulating word embeddings yourself with some short programs.

Generative AI: GenAI tools are permitted as code assistants on this lab (but you are not required or expected to use them). Your grade on this lab is based on your ability to code within the given frameworks, and to achieve the lab outcomes. All GenAI usage must be cited in the code through clear code comments.

Starter Code

Download the starter code and extract it.

You might also find useful my working word lookup tool for examples in using Word2Vec.

Install gensim

You'll need to install the gensim library. Mamba or Conda works here:

conda create -n word2vec     # create an environment for this
conda activate word2vec	
conda install -c conda-forge gensim

If you don't have conda, please install it locally and then run the above:

mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh

~/miniconda3/bin/conda init

Or if you're not using conda and want to risk a direct pip install:

pip3 install --user --upgrade gensim

Part 1

python3 calc.py

I give you a working calculator (3 + 5 - 2) and you must transform it into a word calculator (king - man + woman). Below is some basic code to load pre-trained embeddings and retrieve some word vectors. You can also review my working word lookup program. The returned embeddings are stored in an efficient data type ndarray from the numpy library. All you need to know for now is that it uses normal + and - operators! Here's an example:

import gensim.downloader as api
from gensim.models.word2vec import Word2Vec

embeds = api.load("glove-wiki-gigaword-50") # 64MB

v1 = embeds['good']	  
v2 = embeds['morning']	  
gm = v1 + v2
sims = embeds.similar_by_vector(gm)
	  
print(sims)

You'll notice that the first item in "sims" is sometimes one of the original words in your equation, so let's print both the first and second similar words. You'll also need to know how to initialize a numpy vector of zeros:

import numpy
X = numpy.zeros(50)

The above is all you need. I gave you a working number calculator calc.py. Now change it to do words. Your output must match the following:

calc> hello
hello goodbye
calc> hi
hi ho
calc> north
north south
calc> king - man + woman
king queen
calc> father - man + woman
mother daughter
calc> mother - woman + man
father friend
calc> programmer - bad + good
programmer versatile
calc> programmer - good + bad
programmer glitch

For the curious, this page has some great examples of word arithmetic that you can try. The python examples are using a different library called PyTorch, but you can ignore the code and look at the word arithmetic for inspiration.

Part 2

python3 search.py -data ../lab4/data

In this part, you will build a robust Twitter search engine! Instead of searching by keywords (boring), we'll search by word embeddings (thrilling!) so that we can match strings that use different but related word choices.

search: hello how are you
0.997	@Shockhound35 hey Anson, I'm good, how are you doing?
0.993	@mdmcQ1209 you are the best!!! Forever!!! Sorry you are!!
0.993	@Saulaanator how are you hun?
0.992	you are welcome @sarahshah
0.992	@ImmaMonsterhehe Why are you crying? :(
0.991	dude, how can you not love jongkey? >:
...
search: sleepy time
0.964	I'm sleepy but it's early so I'm up
0.958	Can't sleep
0.958	Sleep deprivation!
0.958	Sleep
0.958	Sleep >>>
0.954	@iamjasher go bes. Nap nap time
0.949	Time to bed really exhausted #earlybirds tomorrow
search: quit

You must fill in search.py so the output is like the above. Print the top 10 matching tweets with their scores in the exact format above. I gave you starter code to load 50k tweets to search. I also gave you a cosine(v1,v2) function for measuring distance.

The last piece you might need is sorting a bunch of similarity scores. You can store tweets and their scores in a dictionary or as pairs in a list [ (score,tweet), (score2,tweet2), ... ]. The latter is probably easiest, and can be sorted like this:

scored = list()
scored.append( (5, 'some tweet') )
scored.append( (2, 'another tweet') )
scored.append( (9, 'yet another tweet') )      
scored.sort(reverse=True)

Play around with the 50k once you're finished and have fun searching! Don't forget to clean your tweet strings!

What to turn in

calc.py and search.py and readme.txt

Answer the following questions and put them in readme.txt:

Discover 3 interesting (and new) word arithmetic equations for Part 1. Copy/paste them + output into readme.txt. Yes, this will take you some time to explore. You can't do it in 2 minutes. No credit for examples you find on the link I gave you above.
"doctor - man + woman" produces a result that has been used as an example of how AI might learn to stereotype employment choices (there are examples along other dimensions of our shared humanity too). You might try experimenting with other combinations. The question here that you must answer for me: what aspects of the wikipedia/news dataset might have helped to "teach" the learner so that it produces this biased result? Hypothesize what in the language of the training data guided it to this behavior (hint: stating "more women are nurses" is not an answer because that doesn't explain what it sees in the text. What does word2vec actually see/do that causes it?).
Discover two interesting multi-word searches for Part 2 that work well (discovers different but related words), and find two searches that don't work well (irrelevant matches). Show me both and explain why the poor results might have been returned/learned.

How to turn in

Upload all to our submission webpage.

submit -c=SI425 -p=lab05 readme.txt calc.py search.py