Lab 6: BERT

Due: next week Oct 26

Motivation

Now that we've covered basic models for neural NLP architectures, this lab will give you a chance to implement one of the more interesting models based on the Transformer network, called BERT. Most of today's NLP applications boil down to representing text in some high-dimensional space, so the challenge is to convert a passage of text into a single vector that somehow represents its meaning. BERT does just that.

Installation and Starter Code

Install the neural network libraries, PyTorch and Transformers, along with other helpful tools:

pip3 install --user transformers[tf-cpu] torchvision pandas seaborn scikit-learn gdown nltk

Setup your PATH to point to these new things:

echo "export PATH=\$PATH:$HOME/.local/bin" >> ~/.bashrc
source ~/.bashrc

Download the reviews dataset that this lab uses:

gdown --id 1zdmewp7ayS4js4VtrJEHzAheSW-5NBZv

You should have seen reviews.csv downloaded to your current directory.

Part 1: Create a sentence embedding (50%)

Start a new program and name it mybert.py

Your only goal in this part is to figure out how to load a pre-trained BERT model and push an example sentence through BERT to output a sentence embedding. Let's recall the BERT pipeline:

Each sentence is prefixed with a CLS token, and that final token's transformed representation ultimately represents the entire sentence's meaning. We want to grab that final CLS embedding output. See the picture to the right for the short sentence input "hey you".

First off, BERT was trained on a large corpus of text, so it obviously has its own tokenizer that converts tokens into unique IDs. You need to make sure you convert your own sentence(s) into IDs so that BERT knows what you're inputting. We just load the tokenizer from disk like this:

from transformers import BertTokenizer, BertModel, BertConfig
	    
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Next, we want to load the BERT model itself with all its glorious pre-trained parameters:

config = BertConfig.from_pretrained('bert-base-uncased', output_hidden_states=True)
model = BertModel.from_pretrained('bert-base-uncased', config=config)

Now that we have a tokenizer and the model itself, we just have to know how to use the API to extract the things that we need. Notice above how we set the configuration to "output_hidden_states" for us. See the picture above again; we want to grab that first vector from the final layer of hidden states.

In the following code, I tokenize a single sentence (it also accepts a list of sentences). I then call the model on that input, and the sentence(s) is pushed through BERT until all hidden states are returned in the outputs variable.

inputs = tokenizer("this is a sentence", return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state

The above is all you need. The last_hidden_states is a list of final BERT layers. If you input a list of N sentences, then you have N final layers here. If you just did one sentence like my example here, then the list is of size one, and the first element is the final layer of that one sentence. What is in this final layer? It's the output states from each word at the top level.

Your input tokens come in the bottom, and the same number are output at the top. That's your final layer, where each element of the layer is an embedding of the input token, but contextualized with the entire sentence. Which part of the final layer do we want for our sentence? There are a couple options:

Use just the first embedding that represents the input's 'CLS' token. The original BERT paper suggested this.
Concatenate all embeddings together. This is often done now.

For this lab, take Option 1. Remember that every sentence you tokenize is prefaced by BERT (you don't have to input it) with an additional "CLS" token, and ended with a "SEP" token. Let's do Option 1 so that your sentence embedding is just the CLS output. Below is the start of the 768 length output of my CLS embedding for "Hello I am delighted":

tensor([ 3.0961e-02,  4.6253e-01,  8.3626e-02, -2.9063e-01, -2.6032e-01,
        -4.6076e-01,  1.8929e-01,  5.3516e-01, -9.3427e-03, -3.0911e-01,
        -1.0907e-01, -2.5538e-01,  3.4611e-01,  4.9897e-01,  1.0202e-01,
        -1.9622e-01, -3.2630e-01,  3.8475e-01,  3.2745e-01, -2.2837e-02,
        -2.0363e-01, -4.8284e-01, -1.5192e-01,  5.3714e-02, -1.5668e-01, ...

Write a function that takes one sentence, and returns the above vector. For this part, also print it out. Run a test for "Hello I am delighted" and match my above output.

Part 2: clustering sentences with BERT (95%)

After Part 1, you should now have a simple function -- input a sentence, output an embedding. And now that you can convert sentences into BERT embeddings, you can make a useful data exploration program.

Create a new program called cluster.py

Your task is to take the short online reviews about a phone app, and cluster them to discover the main topics that reviewers discuss. Pretend you're an analyst for a startup company, and you need to know how your app is doing.

Download my data.py file that contains two functions:

reviews = get_review_sentences(N=100)    # returns 100 review sentences as a List of strings

# OPTION 1: Runs k-means cluster with hard-coded k-clusters
labels = kmeans(your_numpy_matrix, k=10)   # labels each row with a cluster ID

# OPTION 2: Runs agglomerative clustering that discovers the \# of clusters
agg = sklearn.cluster.AgglomerativeClustering(n_clusters=None, distance_threshold=0.15, affinity="cosine", linkage="average", compute_distances=True)
labels = agg.fit(your_numpy_matrix).labels_

You have everything you need to create a clustering program. Your Part 1 converts text into vectors, and I just showed you two clustering algorithms. The only missing piece is the technical bit of creating that numpy matrix for the clustering functions:

# Create an empty matrix NxM where M is the length of your embeddings and you have N samples.
your_numpy_matrix = np.empty(shape=(N,M))
# Now fill in all the rows with your BERT embeddings
for i in range(N):	    
  your_numpy_matrix[i] = ______

Given the above, put the pieces together and write a program that clusters the review text. Your program then needs to print all clusters in order like this:

***** Cluster 7 *****
content
"The app felt good enough for me to buy the yearly subscription
"This App is useless 
"Thanks to the developer for your quick response
"This was originally a 5 star app
"Hey 
..."

All clusters should print out, and you should pipe your output to a file called clusters.txt.

Run your program on 1000 reviews with either agglomerative clustering or kmeans using 10 clusters.

Part 3: word cloud (100%)

After clustering and printing them all out, prompt the user asking for a cluster ID. Upon entry, create a word cloud that shows the words from the reviews in their cluster. Make this a loop, so after they close the word cloud, you prompt again and they can continually look at clouds.

You just need the WordCloud library:

pip3 install --user wordcloud

Here is how to make and show a word cloud with Python:

# Don't forget your library imports!
import matplotlib.pyplot as plt   # we had this one before
from wordcloud import WordCloud   # new for WordCloud

# The cloud!
doc = "This is a long string with happy words to put in a visual word cloud...blah blah...it makes repeated words bigger than single occurrence words. It splits all the words for you, easy peasy."
cloud = WordCloud(width=480, height=480, margin=0).generate(doc)    # 'doc' is the constructed tweet string
          
# Now popup the display of our generated cloud image.
plt.imshow(cloud, interpolation='bilinear')
plt.axis("off")
plt.margins(x=0, y=0)
plt.show()

Tip! Make your clouds more useful by removing a few of the common words that appear in all these reviews and clusters, like "app" and "task".

What to turn in

mybert.py, cluster.py, cluster.txt

Did you run this on 1000 reviews and 10 clusters?

How to turn in

Upload all to our submission webpage.