Due: next week Oct 26
Now that we've covered basic models for neural NLP architectures, this lab will give you a chance to implement one of the more interesting models based on the Transformer network, called BERT. Most of today's NLP applications boil down to representing text in some high-dimensional space, so the challenge is to convert a passage of text into a single vector that somehow represents its meaning. BERT does just that.
Install the neural network libraries, PyTorch and Transformers, along with other helpful tools:
pip3 install --user transformers[tf-cpu] torchvision pandas seaborn scikit-learn gdown nltk
Setup your PATH to point to these new things:
echo "export PATH=\$PATH:$HOME/.local/bin" >> ~/.bashrc source ~/.bashrc
Download the reviews dataset that this lab uses:
gdown --id 1zdmewp7ayS4js4VtrJEHzAheSW-5NBZv
You should have seen reviews.csv downloaded to your current directory.
Start a new program and name it mybert.py
Your only goal in this part is to figure out how to load a pre-trained BERT model and push an example sentence through BERT to output a sentence embedding. Let's recall the BERT pipeline:
Each sentence is prefixed with a CLS token, and that final token's transformed representation ultimately represents the entire sentence's meaning. We want to grab that final CLS embedding output. See the picture to the right for the short sentence input "hey you".
First off, BERT was trained on a large corpus of text, so it obviously has its own tokenizer that converts tokens into unique IDs. You need to make sure you convert your own sentence(s) into IDs so that BERT knows what you're inputting. We just load the tokenizer from disk like this:
from transformers import BertTokenizer, BertModel, BertConfig
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
Next, we want to load the BERT model itself with all its glorious pre-trained parameters:
config = BertConfig.from_pretrained('bert-base-uncased', output_hidden_states=True)
model = BertModel.from_pretrained('bert-base-uncased', config=config)
Now that we have a tokenizer and the model itself, we just have to know how to use the API to extract the things that we need. Notice above how we set the configuration to "output_hidden_states" for us. See the picture above again; we want to grab that first vector from the final layer of hidden states.
In the following code, I tokenize a single sentence (it also accepts a list of sentences). I then call the model on that input, and the sentence(s) is pushed through BERT until all hidden states are returned in the outputs variable.
inputs = tokenizer("this is a sentence", return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
The above is all you need. The last_hidden_states is a list of final BERT layers. If you input a list of N sentences, then you have N final layers here. If you just did one sentence like my example here, then the list is of size one, and the first element is the final layer of that one sentence. What is in this final layer? It's the output states from each word at the top level.
Your input tokens come in the bottom, and the same number are output at the top. That's your final layer, where each element of the layer is an embedding of the input token, but contextualized with the entire sentence. Which part of the final layer do we want for our sentence? There are a couple options:
For this lab, take Option 1. Remember that every sentence you tokenize is prefaced by BERT (you don't have to input it) with an additional "CLS" token, and ended with a "SEP" token. Let's do Option 1 so that your sentence embedding is just the CLS output. Below is the start of the 768 length output of my CLS embedding for "Hello I am delighted":
tensor([ 3.0961e-02, 4.6253e-01, 8.3626e-02, -2.9063e-01, -2.6032e-01,
-4.6076e-01, 1.8929e-01, 5.3516e-01, -9.3427e-03, -3.0911e-01,
-1.0907e-01, -2.5538e-01, 3.4611e-01, 4.9897e-01, 1.0202e-01,
-1.9622e-01, -3.2630e-01, 3.8475e-01, 3.2745e-01, -2.2837e-02,
-2.0363e-01, -4.8284e-01, -1.5192e-01, 5.3714e-02, -1.5668e-01, ...
Write a function that takes one sentence, and returns the above vector. For this part, also print it out. Run a test for "Hello I am delighted" and match my above output.
After Part 1, you should now have a simple function -- input a sentence, output an embedding. And now that you can convert sentences into BERT embeddings, you can make a useful data exploration program.
Create a new program called cluster.py
Your task is to take the short online reviews about a phone app, and cluster them to discover the main topics that reviewers discuss. Pretend you're an analyst for a startup company, and you need to know how your app is doing.
Download my data.py file that contains two functions:
reviews = get_review_sentences(N=100) # returns 100 review sentences as a List of strings
# OPTION 1: Runs k-means cluster with hard-coded k-clusters
labels = kmeans(your_numpy_matrix, k=10) # labels each row with a cluster ID
# OPTION 2: Runs agglomerative clustering that discovers the \# of clusters
agg = sklearn.cluster.AgglomerativeClustering(n_clusters=None, distance_threshold=0.15, affinity="cosine", linkage="average", compute_distances=True)
labels = agg.fit(your_numpy_matrix).labels_
You have everything you need to create a clustering program. Your Part 1 converts text into vectors, and I just showed you two clustering algorithms. The only missing piece is the technical bit of creating that numpy matrix for the clustering functions:
# Create an empty matrix NxM where M is the length of your embeddings and you have N samples.
your_numpy_matrix = np.empty(shape=(N,M))
# Now fill in all the rows with your BERT embeddings
for i in range(N):
your_numpy_matrix[i] = ______
Given the above, put the pieces together and write a program that clusters the review text. Your program then needs to print all clusters in order like this:
***** Cluster 7 ***** content "The app felt good enough for me to buy the yearly subscription "This App is useless "Thanks to the developer for your quick response "This was originally a 5 star app "Hey ..."
All clusters should print out, and you should pipe your output to a file called clusters.txt.
Run your program on 1000 reviews with either agglomerative clustering or kmeans using 10 clusters.
After clustering and printing them all out, prompt the user asking for a cluster ID. Upon entry, create a word cloud that shows the words from the reviews in their cluster. Make this a loop, so after they close the word cloud, you prompt again and they can continually look at clouds.
You just need the WordCloud library:
pip3 install --user wordcloud
Here is how to make and show a word cloud with Python:
# Don't forget your library imports!
import matplotlib.pyplot as plt # we had this one before
from wordcloud import WordCloud # new for WordCloud
# The cloud!
doc = "This is a long string with happy words to put in a visual word cloud...blah blah...it makes repeated words bigger than single occurrence words. It splits all the words for you, easy peasy."
cloud = WordCloud(width=480, height=480, margin=0).generate(doc) # 'doc' is the constructed tweet string
# Now popup the display of our generated cloud image.
plt.imshow(cloud, interpolation='bilinear')
plt.axis("off")
plt.margins(x=0, y=0)
plt.show()
Tip! Make your clouds more useful by removing a few of the common words that appear in all these reviews and clusters, like "app" and "task".
mybert.py, cluster.py, cluster.txt
Did you run this on 1000 reviews and 10 clusters?
Upload all to our submission webpage.
Login, select SI425 and Lab 6. Click your name and use the 'Upload Submission' option in the menu.