You'll need to install the gensim library:
conda create -n word2vec # create an environment for this conda activate word2vec conda install -c conda-forge gensim
Or if you're not using conda and want to risk a direct pip install:
pip3 install --user --upgrade gensim
This program loads 50-dimensional vector embeddings for words observed in Wikipedia and the Gigaword Corpus (news articles). It then allows the user to make queries from the command line for word similarity and embedding output.
Cmd: sim obama bush Sim between obama and bush = 0.96 Cmd: sim obama ginsburg Sim between obama and ginsburg = 0.20 Cmd: mostsim ginsburg Most similar [('breyer', 0.9350610971450806), ('souter', 0.8723862767219543), ('rehnquist', 0.8637943267822266), ('bader', 0.8579131960868835), ('scalia', 0.8438905477523804), ('dissented', 0.7967802882194519), ('justices', 0.7810587286949158), ("o'connor", 0.7615455389022827), ('alito', 0.7479578852653503), ('stevens', 0.7473843693733215)] Cmd: sim computer laptop Sim between computer and laptop = 0.77 Cmd: vec computer computer embedding: [ 0.079084 -0.81504 1.7901 0.91653 0.10797 -0.55628 -0.84427 -1.4951 0.13418 0.63627 0.35146 0.25813 -0.55029 0.51056 0.37409 0.12092 -1.6166 0.83653 0.14202 -0.52348 0.73453 0.12207 -0.49079 0.32533 0.45306 -1.585 -0.63848 -1.0053 0.10454 -0.42984 3.181 -0.62187 0.16819 -1.0139 0.064058 0.57844 -0.4556 0.73783 0.37203 -0.57722 0.66441 0.055129 0.037891 1.3275 0.30991 0.50697 1.2357 0.1274 -0.11434 0.20709 ] Cmd: quit
import gensim.downloader as api
from gensim.models.word2vec import Word2Vec
# Download a pre-trained model (100,200 also available)
print('Loading the embeddings...')
embeds = api.load("glove-wiki-gigaword-50") # 66MB
# embeds = api.load("glove-twitter-25") # 105MB, if you want Twitter-trained embeddings
while(True):
words = input('Cmd: ')
(cmd, *parts) = words.strip().split()
# 'sim obama bush'
if cmd == 'sim':
sim = embeds.similarity(parts[0],parts[1])
print('Sim between %s and %s = %.2f' % (parts[0],parts[1],sim))
# 'mostsim obama'
elif cmd == 'mostsim':
print('Most similar', embeds.most_similar(parts[0],topn=10))
# 'vec obama'
elif cmd == 'vec':
print('%s embedding: %s' % (parts[0], str(embeds[parts[0]])))
elif cmd == 'exit' or cmd == 'quit':
break
else:
print('Commands: sim|mostsim|vec')