- Due before 23:59 on Tuesday, November 1
- Submit using the submit program
301 proj 2.
Keeping passwords secret is a big, and very difficult, part of keeping many systems secure. It would obviously be a very poor idea to store a user's password in plaintext on the system. But then how can the system know that someone's entered password is correct?
The way this normally works is that in place of storing the password, the
system instead stores a hash of the password in a shadow file. This
way, when somebody enters their password, the system merely needs to hash that
string, and compare the hash to the stored hash. For an example of this, on a
linux machine or virtual machine that you have sudo access on, have a look at
The advantage of this approach is that even if that shadow file is compromised, attackers can't reverse-compute the password given the hash, so the password remains safe... or does it?
People don't use random passwords. In fact, there is a great deal of repetition of passwords across users, and across sites. So, suppose you had a shadow file. Couldn't you pre-compute the hashes of common passwords, and see if any of the hashes match the ones in your file? Of course you could.
There are ways to combat this (such as salting your passwords), but unfortunately many websites still don't do that, and a number of recent, high-profile breaches have resulted in users' passwords being revealed from exactly this kind of attack. Since these brute-force tables remain an often-effective technique for discovering passwords, and in order to understand that we want to know exactly how fast such an attack can be!
The following paragraphs refer to the files inside a tarball called
proj2.tar.gz, which you can extract by
tar xzvf proj2.tar.gz.
You can also access some of the files with the links below.
As a warning, the contents contain real-world data, with the associated
profanity, racism, etc.
You can get all these files at once by downloading the file
tar xzvf proj2.tar.gz
RockYou is a Silicon Valley company that deals with web advertising, social
media games, and other techno-buzzwordy-things. What they do isn't so
relevant to us; what is relevant is that they are historically just atrocious
at security. In 2009, their account passwords (which they stored in
plaintext) were stolen and made public; this set of passwords is now one of
the go-to datasets to see the relative frequency of user-defined passwords.
rockyouwithcount.txt contains those passwords, along with the number
of times that password was used.
shortrockyouwithcount.txt are subsets of that list, useful for when
you don't want to deal with 14 million entries. These files are sorted by
I have also gone ahead and precomputed the MD5 hashes for those
passwords. You can see those hashes in
there medium and short versions, corresponding to the *rockyouwithcount
files). These are (almost) sorted by hash value, though there are a few
Finally, you can generate a shadow file using
$ python3 shadowgen.py rockyouwithcount.txt 5000 shadow.txt answers.txt
Once this is done running, shadow.txt will contain 5000 username-hash
pairs, where the usernames are random English words, and the passwords are
randomly selected from the list
rockyouwithcount.txt, in the same
frequency as indicated by their counts. In other words, the hash for
123456 will show up much more frequently than the hash
pookie168. answers.txt will contain the username-password pairs
in plain text.
Your goal on this is to identify the passwords associated with the hashes
in a shadow file. So, what you'll need is a Map, where the keys are hashes,
and the values are plaintext passwords. The faster your Map works, the faster
your program will run. I've started your code for you in
brute.py. All your code must go in the indicated location, so
that it can be timed without taking into account the hard drive access or the
time it takes to read the hashes and print the answers.
It's fine to use any of the built-in Python modules as long as you've built your own data structures and written that part of the code yourself from scratch. And as usual, keep in mind the course honor policy on collaboration for projects.
Step 0: Get the program working using python's built-in map, the dictionary. This will help you understand just what it is you're doing with this project.
Step 1: up to 72 points: Implement your map, and get your program working, using your Treap from Lab 8, or a BST with randomly shuffled input order.
Step 2: up to 95 points: Use anything else we've learned which is faster (runs reasonably quickly on the medium-sized password list). The only exception is Treaps or using a BST with shuffled input (too easy). This data structure must be entirely implemented yourself! Your program should be able to handle the medium-sized list.
Step 3: up to 100 points: Add something we haven't learned. Maybe it's a data structure you found online (you may learn about data structures from anywhere you want, but it must be cited, and the implementation must be your own). Maybe it's some way of leveraging the password counts. Maybe it's parallelism. Try your faster program on the full list and large shadow files (30,000 or so users).
BONUS for everyone: If anyone in the class gets their program to run faster than mine, everyone will get +2 points on this project.
To be explicit, for a program that works correctly, your workflow will look like this:
- Generate a shadow file to be cracked, and the plaintext passwords
associated with that shadow file, using the
shadowgen.pyprogram that I'm giving you.
- Given that shadow file, crack it with brute.py. The output should be username/password pairs, one per line, separated by a tab character, just like the plaintext "answers" file created by shadowgen.py.
README.txt which details which step you've achieved, and what you've
implemented to make it work. What are the runtimes of add and get in your
It's OK to include any other *.py files in your submission, but the
brute.py should represent the best solution you have.