This is the archived website of SI 486H from the Spring 2016 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.

# Problem 77

Due: April 12
Points: 6

Write a program that uses a Bloom filter to check whether passwords are any good. Your program should display a prompt for the user to enter a password, and then if the password is either the same as (1) a common English word, or (2) a previously entered password, then you reject the password. If the password is accepted, you add it to the Bloom filter and move on.

Example:

1st password entered: edification
REJECTED (common English word)

ACCEPTED

REJECTED (common English word)

ACCEPTED

REJECTED (false positive in Bloom filter)

REJECTED (previously entered)

ACCEPTED

Of course, in the actual program, the reason a password is rejected would not be listed! But the false positives should be rare - in particular, your Bloom filter should have a probability of false positives less than 10%.

# Specifics

You will need to choose the following for your program to work properly:

• A list of common English words. Use the dictionary that is found in any standard Linux distribution at /usr/share/dict/words. These words are listed one per line, all lower-case.

• A function that maps strings to integers, for the English words. More detailed instructions on how to do this in C++, Java, and Python are below. In any case, your hash function should produce 32-bit integers.

• A cryptographic hash function to hash the passwords to a large integer, before they are hashed again in multiple ways and inserted into the Bloom filter. The idea is, you don't want to use a regular hash function for the passwords because then someone might be able to figure out someone's password based on collisions that occur. You can use any secure cryptographic hash function that you find out about, as long as it's clearly documented.

• A universal family of hash functions. You should use the Carter & Wegman hash functions, which choose a fixed large prime $$p$$, and then choose random $$a$$ and $$c$$ so that

$h(x) = ((ax + c) \bmod p) \bmod s$

Since $$p$$ is "fixed" here, you can choose $$p$$ to be any large prime of your choosing. "Large prime" here means that $$p$$ should be at least $$2^{31}$$.

Your program must randomly choose values for $$a$$ and $$c$$ every time it is run. This means that the false positives should be different every time the program is run.

• A properly-seeded PRNG to choose values $$a$$ and $$c$$ for each of your $$k$$ hash functions. You should be pros at this by now.

• A Bloom filter table size $$s$$ and number of hash functions $$k$$ that will guarantee at most $$1/10$$ probability of getting a false positive. You should try to make $$s$$ and $$k$$ as small as possible to achieve this error probability.

You can run wc /usr/share/dict/words to discover that there are 98569 English words in this dictionary that will go into your Bloom filter. You can assume that there are less than 1000 new passwords that will be entered, and so you can use $$n=100000$$ to do the calculations required to determine $$s$$ and $$k$$.

Explain your choices for $$p$$, $$s$$, and $$k$$ very clearly in comments in your code. Be sure to give references to any sources that you used.

# Strings to integers

• Python: There is a built-in function hash that will do this for you. For example, hash('corned beef') returns 1456869758 on the lab machines.
• Java: The hashCode() method in the String class does the same thing, more or less. You just do like "corned beef".hashCode() to get an int from that string.
• C++: There is no built in hashing in the current C++ standard. You may use this handy-dandy function instead:

int hash(string s) {
int mulbase = 97; // arbitrarily-chosen prime number
int curpow = 1; // this equals 97^i at step i.
int result = 0;
for (int i=0; i<s.length(); ++i) {
curpow *= mulbase;
result += s[i]*curpow;
}
return result;
}

# Submit

Submit your program according to the instructions on the submit page.