Lab 6: BERT

Due: next week Oct 24

Motivation

Now that we've covered basic models for neural NLP architectures, this lab will give you a chance to implement one of the more interesting models based on the Transformer network, called BERT. Most of today's NLP applications boil down to representing text in some high-dimensional space, so the challenge is to convert a passage of text into a single vector that somehow represents its meaning. BERT does just that.

Generative AI: GenAI tools are permitted as code assistants on this lab (but you are not required or expected to use them). Your grade on this lab is based on your ability to code within the given frameworks, and to achieve the lab outcomes. All GenAI usage must be cited in the code through clear code comments.

Installation and Starter Code

Install the neural network libraries, PyTorch and Transformers, along with other helpful tools:

conda create -n bert
conda activate bert	  
conda install transformers[tf-cpu] torchvision pandas seaborn scikit-learn nltk conda-forge::gdown

Setup your PATH to point to these new things:

echo "export PATH=\$PATH:$HOME/.local/bin" >> ~/.bashrc
source ~/.bashrc

Download the reviews dataset that this lab uses:

gdown --id 1zdmewp7ayS4js4VtrJEHzAheSW-5NBZv

You should have seen reviews.csv downloaded to your current directory.

Part 1: Create a sentence embedding (50%)

Start a new program and name it mybert.py

Your only goal in this part is to figure out how to load a pre-trained BERT model and push an example sentence through BERT to output a sentence embedding. Let's recall the BERT pipeline:

Each sentence is prefixed with a CLS token, and that final token's transformed representation ultimately represents the entire sentence's meaning. We want to grab that final CLS embedding output. See the picture to the right for the short sentence input "hey you".

First off, BERT was trained on a large corpus of text, so it obviously has its own tokenizer that converts tokens into unique IDs. You need to make sure you convert your own sentence(s) into IDs so that BERT knows what you're inputting. We just load the tokenizer from disk like this:

from transformers import BertTokenizer, BertModel, BertConfig
	    
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Next, we want to load the BERT model itself with all its glorious pre-trained parameters:

config = BertConfig.from_pretrained('bert-base-uncased', output_hidden_states=True)
model = BertModel.from_pretrained('bert-base-uncased', config=config)

Now that we have a tokenizer and the model itself, we just have to know how to use the API to extract the things that we need. Notice above how we set the configuration to "output_hidden_states" for us. See the picture above again; we want to grab that first vector from the final layer of hidden states.

In the following code, I tokenize a single sentence (it also accepts a list of sentences). I then call the model on that input, and the sentence(s) is pushed through BERT until all hidden states are returned in the outputs variable.

inputs = tokenizer(["this is a sentence"], padding=True, return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state[0] # [0] grabs the first sentence in the given list

The above is all you need. Notice the input list around "this is a sentence". You can give it multiple sentences at once, if you wish, and it efficiently processes all of them. The last_hidden_states is a LIST of final BERT layers, one for each of the input sentences (of all the sentences you gave it, we take index [0] to get the last layer of our "this is a sentence" input. If you just give one sentence like my example here, then the list is of size one, and the first element is the final layer of that one sentence). What is in this final layer? It's the output states from each word at the top level.

Your input tokens come in the bottom, and the same number are output at the top. That's your final layer, where each element of the layer is an embedding of the input token, but contextualized with the entire sentence. Which part of the final layer do we want for our sentence? There are a couple options:

  1. Use just the first embedding that represents the input's 'CLS' token. The original BERT paper suggested this.
  2. Concatenate all embeddings together. This is often done now.

For this lab, take Option 1. Remember that every sentence you tokenize is prefaced by BERT (you don't have to input it) with an additional "CLS" token, and ended with a "SEP" token. Let's do Option 1 so that your sentence embedding is just the CLS output. Below is the start of the 768 length vector of my CLS embedding for "Hello I am delighted":

tensor([ 3.0961e-02,  4.6253e-01,  8.3626e-02, -2.9063e-01, -2.6032e-01,
        -4.6076e-01,  1.8929e-01,  5.3516e-01, -9.3427e-03, -3.0911e-01,
        -1.0907e-01, -2.5538e-01,  3.4611e-01,  4.9897e-01,  1.0202e-01,
        -1.9622e-01, -3.2630e-01,  3.8475e-01,  3.2745e-01, -2.2837e-02,
        -2.0363e-01, -4.8284e-01, -1.5192e-01,  5.3714e-02, -1.5668e-01, ...

Write a function that takes one sentence, and returns the above vector. For this part, also print it out. Run a test for "Hello I am delighted" and match my above output. Notice my output is one tensor [] not multiple brackets [[ ]]. You want to grab just the CLS token from the front.

Part 2: clustering sentences with BERT (95%)

After Part 1, you should now have a simple function -- input a sentence, output an embedding. And now that you can convert sentences into BERT embeddings, you can make a useful data exploration program.

Create a new program called cluster.py

Your task is to take the short online reviews about a phone app, and cluster them to discover the main topics that reviewers discuss. Pretend you're an analyst for a startup company, and you need to know how your app is doing.

Download my data.py file that contains two functions:

reviews = get_review_sentences(N=100)    # returns 100 review sentences as a List of strings

# OPTION 1: Runs k-means cluster with hard-coded k-clusters
labels = kmeans(your_numpy_matrix, k=10)   # labels each row with a cluster ID

# OPTION 2: Runs agglomerative clustering that discovers the \# of clusters
import sklearn.cluster
agg = sklearn.cluster.AgglomerativeClustering(n_clusters=None, distance_threshold=0.15, affinity="cosine", linkage="average", compute_distances=True)
labels = agg.fit(your_numpy_matrix).labels_

You have everything you need to create a clustering program. Your Part 1 converts text into vectors, and I just showed you two clustering algorithms. The only missing piece is the technical bit of creating that numpy matrix for the clustering functions:

# Create an empty matrix NxM where M is the length of your embeddings and you have N samples.
your_numpy_matrix = np.empty(shape=(N,M))
# Now fill in all the rows with your BERT embeddings
for i in range(N):	    
  your_numpy_matrix[i] = ______

You're building a numpy matrix above, so each vector you put in must be a numpy vector. One last detail is that your BERT vectors are coming out with PyTorch, so you'll get a type mismatch. You can convert PyTorch to numpy like:

numpyvec = vec.detach().numpy()

Given the above, put the pieces together and write a program that clusters the review text. Your program then needs to print all clusters in order like this:

***** Cluster 7 *****
content
"The app felt good enough for me to buy the yearly subscription
"This App is useless 
"Thanks to the developer for your quick response
"This was originally a 5 star app
"Hey 
..."

All clusters should print out, and you should pipe your output to a file called clusters.txt.

Run your program on 1000 reviews with either agglomerative clustering or kmeans using 10 clusters.

Part 3: word cloud (100%)

After clustering and printing them all out, prompt the user asking for a cluster ID. Upon entry, create a word cloud that shows the words from the reviews in their cluster. Make this a loop, so after they close the word cloud, you prompt again and they can continually look at clouds. The interaction should look like this:

python3 cluster.py
***** Cluster 0 *****
...
***** Cluster 9 *****
...

Enter a cluster number [0-9] to view as a cloud: 3
Enter a cluster number [0-9] to view as a cloud: 7
Enter a cluster number [0-9] to view as a cloud: quit
Goodbye!

You just need the WordCloud library:

pip3 install --user wordcloud

Here is how you create and show one word cloud with Python:

# Don't forget your library imports!
import matplotlib.pyplot as plt   # we had this one before
from wordcloud import WordCloud   # new for WordCloud

# The cloud!
doc = "This is a long string with happy words to put in a visual word cloud...blah blah...it makes repeated words bigger than single occurrence words. It splits all the words for you, easy peasy."
cloud = WordCloud(width=480, height=480, margin=0).generate(doc)    # 'doc' is the constructed tweet string
          
# Now popup the display of our generated cloud image.
plt.imshow(cloud, interpolation='bilinear')
plt.axis("off")
plt.margins(x=0, y=0)
plt.show()

Tip! Make your clouds more useful by removing a few of the common words that appear in all these reviews and clusters, like "app" and "task".

What to turn in

clusters.txt mybert.py cluster.py

Did you run this on 1000 reviews and 10 clusters?

readme.txt: look at your clusters. Many of them are difficult to tell what the theme is, but others are not. Find a few decent clusters (at least 2), copy a few sentences from those clusters into your readme, and give a title to each cluster. If you can't find any good ones and you used kmeans clustering, increase k=10 higher to get more fine-grained clusters.

How to turn in

Upload all to our submission webpage.

submit -c=SI425 -p=lab06 readme.txt clusters.txt mybert.py cluster.py