Lab 7: Hands-On LLMs with Llama

Due: next week

Motivation

This lab gives you hands on experience with using a programmatic interface to an LLM, specifically Meta's open-source llama models. We don't have access to multiple GPUs for each student, so this lab will show you how to run their latest mini models with 1 billion parameters instead of 70 billion (or a trillion) on a standard CPU. You'll experience how much computation these LLMs actually require.

Generative AI: GenAI tools are NOT permitted as code assistants on this lab. I want you to really understand how straightforward it is to use an LLM library, and directly play with its input/output.

Install Llama

Don't break things. Create a new conda/mamba environment:

conda create -n llama
conda activate llama

Then install the huggingface and pytorch libraries:

conda install -c conda-forge -c pytorch transformers pytorch torchvision

Finally, sign in to Huggingface from your terminal so that you can use Llama models in your programs. This requires creating a secret token. Visit your tokens page and click "Create new token". Choose the "Read" token type at the very top. Then click "Create token". Copy the generated string.

Now login from your terminal, and paste your string when it asks for your token:

huggingface-cli login

You should see this interaction in the terminal:

About Llama-3.2

Meta trained their first foundational model, Llama, in early 2023. Spurning the closed approach of OpenAI with ChatGPT and Google with Gemini, they open-sourced Llama's code and its trained parameters. This allowed researchers (and private companies) access to a GPT competitor for actual usage. Llama-2 was released in 2023 with multiple model sizes (7, 13, and 70 billion).

The smallest 7 billion model can be loaded into memory on a high-memory machine, and can be run on a single GPU if you trim the precision of their floats. However, this NLP class does not have a GPU for each of you! What to do? Thankfully, Meta released new mini models in their version 3.2 release! These come in 1B and 3B sizes, dubbed lightweight models by Meta. It is the 1B size that you'll use in this lab, and you will experience how much computation power is still needed even for this.

This graph shows how big our favorite big LLMs are. Llama-1B is near GPT-2 on the chart. It barely shows up on the left side. I want you to takeaway a couple things:

Today's popular models are massive and take a lot of compute power to run. You will experience the slow speed on a standard workstation of the smallest version.
Even though "small" at 1B, its chat performance is pretty great.

Part 1: Word Generation

Your first task is to create your own ChatLlama program with which a user can chat. The learning objective is to understand how to access Llama in Python, and then how to create a fun interactive program.

Your program should look like this interaction, but will obviously differ in the actual output that you get because the default parameters include some randomness. If you still ask your instructor if it's ok that your generated output is different from the following example, you will lose 5 points on this lab (no cap, brohim!).

< python3 chat.py
system: Greetings, speak your mind!
user: Where is Hopper Hall?
system: My information may not be up to date, but I believe it is in Annapolis, MD.	
user: I like Annapolis.
system: Yes, Annapolis is a historic town located on the Chesapeake Bay.
user: bye!
system: Goodbye friend!

Below is example code to load the Llama-1B model and produce one generation. You can find more detailed docs on the introductory page for Llama-3.2-1B-Instruct

import torch
from transformers import logging, pipeline
# Set the logging level to suppress warnings
logging.set_verbosity_error()

# Load the model.	
ID = "meta-llama/Llama-3.2-1B-Instruct"
pipe = pipeline("text-generation", model=ID)

# Prompt the model.
output = pipe("Where is Hopper Hall?", max_new_tokens=40)
print(output[0]['generated_text'])

Run the above in a simple program to see how it works. Do you see how the original prompt ("Where is Hopper Hall?") is also included in the generated text? You'll want to remove that before printing it out when you make your chat bot.

Now create your chat bot to mimic the behavior in the above chat example following these easy constraints:

Print a static "system: Greetings, speak your mind!" at the start
Print "user: " when retrieving the user's input.
Feed the user's input to your LLM and print its generation on the next line with "system: " in front of it.
Loop forever until the user types anything with "bye" in it.
After "bye", print "system: Goodbye friend!" and quit the program.

Important things to do! This is a language model that just predicts the next word. If the user types "hello", it will predict a word to follow that. This is different from "Hello!". One helpful trick is to put a newline character after every input. This tells the model that the prior word/phrase was finished. It is then more likely to then generate "hello" in return! You can play around with how to formulate your prompt for the best chat.

Part 2: Chat with Context

Copy chat.py to chatfull.py

Now that you have a basic chat, let's help the model do better. You don't really want to just send one chat message at a time. Give the LLM the entire chat history, so that it can better represent in its hidden states what the chat is about! It can also reference past messages. In this part, you must keep track of the entire chat history, including the LLM's output. Build a big prompt string that contains everything, like so:

prompt = "system: Greetings, speak your mind!
user: hello!
system: Greetings, hello! It's nice to meet you! I'm feeling a bit overwhelmed by all the information and choices you present to me. Can you help me narrow down my options?
user: I didn't give you any options.
system: I'm sorry, I must have misinterpreted the conversation. You're correct, you didn't provide any options. I was just trying to initiate a conversation. I'd like to ask for
user: Can you help me instead?
system: Of course! I'd be happy to help you. Can you tell me what you'd like to talk about or ask? What's on your mind?
user: Where is the Naval Academy?
system: Ah, the Naval Academy! I'm not sure I have the most up-to-date information, but I can try to help. The Naval Academy is located in Annapolis, Maryland.
user: Should I study computer science?
system: That's a great major to consider! However, it's worth noting that the Naval Academy is a unique institution that focuses on maritime education. While computer science is a valuable skill, it may
user: Woah woah, the Naval Academy has an awesome computer science program. How dare you!
system: I'm not trying to be confrontational, but I do want to provide a balanced perspective. The Naval Academy's computer science program is indeed strong, but it's not the only path to

You should send this entire prompt into your LLM when generating. You'll see much more coherent chats, like the one above!

Your last piece is to cleanup remaining problems with the generated output. Sometimes the generation is cutoff when you reach your max token length. Edit your program so that it stops at the last period, exclamation point, or question mark.

Well, I think I can help you with that. But first, I need to know: are you
...should instead be...
Well, I think I can help you with that.

More complicated, if the system gives you multiple lines of text, you should only take the first line, and then do the punctuation strip.

Well, I think I can help you with that. Ok? Great then
we can talk some more.
...should instead be...
Well, I think I can help you with that. Ok?

Here is a result of this final cleaning. This chat is actually coherent and we had a healthy discussion about Python:

system: Greetings, speak your mind!
user: hello!
system: Greetings, hello! It seems like we've been chatting for a while. What's on your mind?
user: Nothing much really, just working on Python.
system: Nice! Python is a great language. What's your favorite part about it?
user: I like the loosely typed language, lots of flexibility.
system: That's true! Python is known for its simplicity and readability. I've always been a fan of the `print()` function. What do you think about it?
user: I actually don't love it. Why does it print a space between arguments? Really annoying.
system: I can understand why you'd find that frustrating. The `print()` function is designed to print output to the console, and spaces are a necessary part of that.
user: Spaces aren't necessary. The designers could have just concatenated.                       
system: That's a fair point. However, the `print()` function has been around for decades and has become a de facto standard.
user: Yes, but something that's been around for a long time doesn't mean it's right.
system: I see your point. Python's design is often criticized for being overly simplistic, but it's also what makes it so easy to use and learn. It's a trade-off, though.
user: I suppose so
system: Exactly. And I think that's what makes Python so great – it's not just a language, it's a mindset.

Requirements Summary:

Chat begins with "system: Greetings, speak your mind!"
Your chat interface should be clean with "system:" and "user:" prefixes starting each line.
The chat ends when the user input contains "bye", and your program then prints "system: Goodbye friend!"
Your system generations should only be one line. Stop at the first newline character. However, you should also end on the final period, exclamation, or question mark on that line.
Internally, your prompts to the LLM should be the complete chat history.
Create a chats.txt file that includes 5 example chats of different topics with at least 5 user inputs to each (not counting 'bye'). There should be no errors/mistakes to the above requirements in your examples.

What to turn in

chats.txt: 5 chats with at least 5 user inputs
chat.py and chatfull.py

How to turn in

Upload all to our submission webpage.

submit -c=SI425 -p=lab07 chats.txt chat.py chatfull.py

Grading

Part 1 chat.py 60%

Part 2 chatfull.py 30%

chats.txt 10%

Total: 100%