Lab 7: Hands-On LLMs with Llama

Due: next week

Motivation

This lab gives you hands on experience with using a programmatic interface to an LLM, specifically Meta's open-source llama models. Most people can't run these models on their laptops due to their size, but since you're in the amazing computer science department, you are currently in the only AI GPU lab on the Yard! Each workstation has a decent small GPU that can handle 7B and 13B parameter models. This lab will show you how to run the 7 billion parameter version. The "medium" models have 70 billion parameters, so you'll get a good sense of how much computation LLMs require.

Generative AI: GenAI tools are NOT permitted as code assistants on this lab. I want you to understand how straightforward it is to use an LLM library, and directly play with its input/output.

Plus, GenAI code for LLM usage tends to be obtuse and overly complex. What we show you in this lab is the simplest way to do it.

GPU Needed!

This lab loads a 13GB sized model from disk and requires a GPU to run the transformer network. Your lab machine has a decent GPU to do this. You cannot run this on your laptop nor through VSCode logged into ssh.cs.usna.edu which is your normal setup. The ssh server does not have a GPU for you.

Before leaving lab, type hostname into your terminal and save the address that is shown (e.g., lnx1472832govt). If you work on this outside the lab, open a terminal, and type: ssh lnx1472832govt to connect directly to your GPU lab machine. Run your program from that terminal. You can still open VSCode and edit your code there, just don't run it from your default VSCode setup unless you ssh directly.

Install Llama

Create a new mamba environment:

mamba create -n llama
mamba activate llama

Install pytorch and GPU libraries:

mamba install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

Run this python test to make sure the GPU libraries successfully installed:

python -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"
# You should see it output: "True and 12.1"

Install HuggingFace transformers now that the GPU is setup:

mamba install -c conda-forge transformers datasets accelerate
pip install bitsandbytes

Finally, sign in to Huggingface from your terminal so that you can use Llama models in your programs. This requires creating a secret token. Visit your tokens page and click "Create new token". Choose the "Read" token type at the very top. Then click "Create token". Copy the generated string.

Now login from your terminal, and paste your string when it asks for your token:

huggingface-cli login

You should see this interaction in the terminal:

About Llama-3.2

Meta trained their first foundational model, Llama, in early 2023. Spurning the closed approach of OpenAI with ChatGPT and Google with Gemini, they open-sourced Llama's code and its trained parameters. This allowed researchers (and private companies) access to a GPT competitor for actual usage. Llama-2 was released in 2023 with multiple model sizes (7, 13, and 70 billion).

The smallest 7 billion model can be loaded into memory on a high-memory machine, and can be run on a single GPU if you trim the precision of their floats. Lucky you, you're in the only AI GPU lab on the Yard! We have a decent GPU in each machine. You must run your code on one of these machines. It will not run in a reasonable amount of time on your laptop.

This graph shows how big our favorite big LLMs are. Llama-7b is near GPT-J on the chart. It barely shows up on the left side. I want you to takeaway a couple things:

  1. Today's popular models are massive and take a lot of compute power to run.
  2. You will experience slower speed on a single GPU, but even though "small" at 7B, its chat performance is decent.

Part 1: Word Generation

Your first task is to create your own ChatLlama program with which a user can chat. The learning objective is to understand how to access Llama in Python, and then how to create a fun interactive program.

Your program should look like this interaction, but will obviously differ in the actual output that you get because the default parameters include some randomness. If you still ask your instructor if it's ok that your generated output is different from the following example, you will lose 5 points on this lab (no cap, brohim!).

< python3 chat.py
system: Greetings, speak your mind!
user: Who are you?
system: and why do you do this?
user: Where is Hopper Hall?
system: Hopper Hall, the first residence hall on the north side of the campus, was constructed in 1925. It was
user: I like Annapolis.
system: and I’m glad I got to see it, but I'm also glad to be leaving. sierp 25, 2018
I'm sure most of you have
user: bye!
system: Goodbye friend!
      

The system's interaction isn't good. We'll fix that in a bit. But below is example code to load the Llama-7B model and produce one generation.

import torch
from transformers import pipeline, BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer

# Load the model from our pre-downloaded directory
ID = "/courses/nchamber/nlp/huggingface/Llama-2-7b-hf"

# Define quantization settings (4-bit) to reduce model size when loaded.
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                  # enable 4-bit quantization
    bnb_4bit_quant_type="nf4",          # use "normal float 4" quantization
    bnb_4bit_compute_dtype=torch.float16,  # compute in half precision
)

# Create the generation pipeline.
tokenizer = AutoTokenizer.from_pretrained(ID, local_files_only=True, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(ID, quantization_config=bnb_config, dtype=torch.float16, device_map="auto", local_files_only=True, trust_remote_code=True)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Prompt the model.
output = pipe("Where is Hopper Hall?", max_new_tokens=40)
print(output[0]['generated_text'])

Run the above in a simple program to see it work. Do you see how the original prompt ("Where is Hopper Hall?") is also included in the generated text? You'll want to remove that before printing it out when you make your chat bot.

Now create your chat bot to mimic the behavior in the above chat example following these easy constraints:

Important things to do! This is a language model that just predicts the next word. If the user types "hello", it will predict a word to follow that. This is different from "Hello!". One helpful trick is to put a newline character after every input. This tells the model that the prior word/phrase was finished. It is then more likely to then generate "hello" in return! You can play around with how to formulate your prompt for the best chat.

Part 1b: Prompting

Your chat output is probably not great. This is a foundation model that is just continuing your input, so the model isn't necessarily predicting words based on a chat context. For this part, edit your program to build a prompt that encourages the model to respond to the user's input as a user in a chat, and it is a system in a chat.

TIP: Think about using "user:" and "system:" prefixes and newlines to encourage the LLM to respond as a system with its own reply, not just continuing the user's utterance.

REQUIRED: Split off newlines in the system reply. If the generated text has a newline(s), only show the first line. For instance:

Well, I think I can help you with that. Ok? Great then
we can talk some more.
...should instead be...
Well, I think I can help you with that. Ok? Great then

After some simple but clear prompting changes, you should now have a more concise and coherent chat like this:

system: Greetings, speak your mind!
user: Who are you?
system: I am a user!
user: Where is Hopper Hall
system: "Hopper Hall is located on the second floor of the west wing of the Student Union."
user: That doesn't sound right, how do you know that?
system: I don't know, but I've always been told that.
user: Well ok, have you been to Annapolis?
system: It's in the same state as DC.
user: Ha no
system: Yes, you can.
user: bye
system: Goodbye friend!

Part 2: Chat with Context

Copy chat.py to chatfull.py

Now that you have a basic chat, let's help the model do better. You don't really want to just send one chat message at a time. Give the LLM the entire chat history, so that it can better represent in its hidden states what the chat is about! It can also reference past messages. In this part, you must keep track of the entire chat history, including the LLM's output. Build a big prompt string that contains everything, like so:

prompt = "system: Greetings, speak your mind!
user: hello!
system: Greetings, hello! It's nice to meet you! I'm feeling a bit overwhelmed by all the information and choices you present to me. Can you help me narrow down my options?
user: I didn't give you any options.
system: I'm sorry, I must have misinterpreted the conversation. You're correct, you didn't provide any options. I was just trying to initiate a conversation. I'd like to ask for
user: Can you help me instead?
system: Of course! I'd be happy to help you. Can you tell me what you'd like to talk about or ask? What's on your mind?
user: Where is the Naval Academy?
system:"	

You should send this entire prompt into your LLM when generating. You'll see much more coherent chats, like the one above!

Your last piece is to cleanup remaining problems with the generated output. Sometimes the generation is cutoff when you reach your max token length. Edit your program to check for really long generations (over 120 characters). If it's too long, find the last period, exclamation point, or question mark and cut it off there.

Well, I think I can help you with that because I know lots of things. But first, I need to know why you are here and what you're trying to
...should instead be...
Well, I think I can help you with that because I know lots of things.

Here is a result of this final cleaning. This chat is actually coherent and the context slowly builds to a fun back-and-forth:

system: Greetings, speak your mind!
user: Hello!
system: Hey there!
user: How you doing friend?
system: Pretty good, and yourself?
user: Great thanks, just writing some Python code.
system: What's up?
user: I told you, writing Python
system: Hey, that's cool! What are you making?
user: I'm making a chatbot
system: Neat! What's it called?
user: Good question I didn't think about giving it a name. How about ChatterBox
system: That's a good name. So, what's it do?
user: It uses Meta's LLama model and just does word generation
system: That's really cool! What's it generate?
user: Right now a boring but friendly chat interaction
system: I see. What do you want it to do?
user: I'd like it to ask me interesting questions
system: That sounds like a good idea. What kinds of questions do you want it to ask?
user: Interestingly, it seems to me getting better at asking questions already!
system: That's great!
user: Do you remember the name of the chatbot?
system: ChatterBox
user: Well done friend, what's your name?
system: My name is ChatterBox
user: Brilliant
system: I think it's pretty good.
user: bye
system: Goodbye friend!
Requirements Summary:
  1. Chat begins with "system: Greetings, speak your mind!"
  2. Your chat interface should be clean with "system:" and "user:" prefixes starting each line.
  3. The chat ends when the user input contains "bye", and your program then prints "system: Goodbye friend!"
  4. Your system generations should only be one line. Stop at the first newline character. However, you should also cutoff long generations and only end on the final period, exclamation, or question mark on that line.
  5. Internally, your prompts to the LLM should include the complete chat history.
  6. When you tell the chatbot your name "Jonah", it should respond correctly when you later ask "What is my name?"
  7. Create a chats.txt file that includes 5 example chats of different topics with at least 5 user inputs to each (not counting 'bye'). There should be no errors/mistakes to the above requirements in your examples.

What to turn in

  1. chats.txt: 5 chats with at least 5 user inputs. One should illustrate giving a name and the system repeating it later in the chat.
  2. chat.py and chatfull.py

How to turn in

Upload all to our submission webpage.

submit -c=SI425 -p=lab7 chats.txt chat.py chatfull.py

Grading

Part 1 chat.py 50%

Part 2 chatfull.py 40%

chats.txt 10%

Total: 100%