Word2Vec Example Code

Install gensim

You'll need to install the gensim library:

conda create -n word2vec     # create an environment for this
conda activate word2vec	
conda install -c conda-forge gensim

Or if you're not using conda and want to risk a direct pip install:

pip3 install --user --upgrade gensim

Example Program to Query Word2Vec

This program loads 50-dimensional vector embeddings for words observed in Wikipedia and the Gigaword Corpus (news articles). It then allows the user to make queries from the command line for word similarity and embedding output.

Cmd: sim obama bush
Sim between obama and bush = 0.96
Cmd: sim obama ginsburg
Sim between obama and ginsburg = 0.20
Cmd: mostsim ginsburg
Most similar [('breyer', 0.9350610971450806), ('souter', 0.8723862767219543), ('rehnquist', 0.8637943267822266), ('bader', 0.8579131960868835), ('scalia', 0.8438905477523804), ('dissented', 0.7967802882194519), ('justices', 0.7810587286949158), ("o'connor", 0.7615455389022827), ('alito', 0.7479578852653503), ('stevens', 0.7473843693733215)]
Cmd: sim computer laptop
Sim between computer and laptop = 0.77
Cmd: vec computer
computer embedding: [ 0.079084 -0.81504   1.7901    0.91653   0.10797  -0.55628  -0.84427
 -1.4951    0.13418   0.63627   0.35146   0.25813  -0.55029   0.51056
  0.37409   0.12092  -1.6166    0.83653   0.14202  -0.52348   0.73453
  0.12207  -0.49079   0.32533   0.45306  -1.585    -0.63848  -1.0053
  0.10454  -0.42984   3.181    -0.62187   0.16819  -1.0139    0.064058
  0.57844  -0.4556    0.73783   0.37203  -0.57722   0.66441   0.055129
  0.037891  1.3275    0.30991   0.50697   1.2357    0.1274   -0.11434
  0.20709 ]
Cmd: quit
  
import gensim.downloader as api
from gensim.models.word2vec import Word2Vec

# Download a pre-trained model   (100,200 also available)
print('Loading the embeddings...')
embeds = api.load("glove-wiki-gigaword-50") # 66MB
# embeds = api.load("glove-twitter-25")  # 105MB, if you want Twitter-trained embeddings

while(True):

    words = input('Cmd: ')
    (cmd, *parts) = words.strip().split()

    # 'sim obama bush'
    if cmd == 'sim':
        sim = embeds.similarity(parts[0],parts[1])
        print('Sim between %s and %s = %.2f' % (parts[0],parts[1],sim))
    # 'mostsim obama'
    elif cmd == 'mostsim':
        print('Most similar', embeds.most_similar(parts[0],topn=10))
    # 'vec obama'
    elif cmd == 'vec':
        print('%s embedding: %s' % (parts[0], str(embeds[parts[0]])))
    elif cmd == 'exit' or cmd == 'quit':
        break        
    else:
        print('Commands: sim|mostsim|vec')