Problem 77

This is the archived version of this course from the Spring 2013 semester. You might find something more recent by visitning my teaching page.

Due: April 19
Points: 8

Write a program that uses a Bloom filter to check whether passwords are any good. Your program should display a prompt for the user to enter a password, and then if the password is either the same as (1) a common English word, or (2) a previously entered password, then you reject the password. If the password is accepted, you add it to the Bloom filter and move on.

Example:

1st password entered: edification
REJECTED (common English word)

ACCEPTED

REJECTED (common English word)

ACCEPTED

REJECTED (false positive in Bloom filter)

REJECTED (previously entered)

ACCEPTED

Of course, in the actual program, the reason a password is rejected would not be listed! But the false positives should be rare - in particular, your Bloom filter should have a probability of false positives less than 10%.

Specifics

You will need to choose the following for your program to work properly:

• A function that maps strings to integers. More detailed instructions on how to do this in C++, Java, and Python are below. In any case, your hash function should produce 32-bit integers.

• A list of common English words. Use the dictionary that is found in any standard Linux distribution at /usr/share/dict/words. These words are listed one per line, all lower-case.

• A universal family of hash functions. You should use the Carter & Wegman hash functions, which choose a fixed large prime $$p$$, and then choose random $$a$$ and $$c$$ so that

$h(x) = ((ax + c) \bmod p) \bmod s$

Since $$p$$ is "fixed" here, you can choose $$p$$ to be any large prime of your choosing. "Large prime" here means that $$p$$ should be at least $$2^{31}$$.

Your program must randomly choose values for $$a$$ and $$c$$ every time it is run. This means that the false positives should be different every time the program is run.

• A properly-seeded PRNG to choose values $$a$$ and $$c$$ for each of your $$k$$ hash functions. You should be pros at this by now.

• A Bloom filter table size $$s$$ and number of hash functions $$k$$ that will guarantee at most $$1/10$$ probability of getting a false positive. You should try to make $$s$$ and $$k$$ as small as possible to achieve this error probability.

You can run wc /usr/share/dict/words to discover that there are 98569 English words in this dictionary that will go into your Bloom filter. You can assume that there are less than 1000 new passwords that will be entered, and so you can use $$n=100000$$ to do the calculations required to determine $$s$$ and $$k$$.

Explain your choices for $$p$$, $$s$$, and $$k$$ very clearly in comments in your code. Be sure to give references to any sources that you used.

Strings to integers

• Python: There is a built-in function hash that will do this for you. For example, hash('corned beef') returns 1456869758 on the lab machines.
• Java: The hashCode() method in the String class does the same thing, more or less. You just do like "corned beef".hashCode() to get an int from that string.
• C++: There is no built in hashing in the current C++ standard. You may use this handy-dandy function instead:

int hash(string s) {
int mulbase = 97; // arbitrarily-chosen prime number
int curpow = 1; // this equals 97^i at step i.
int result = 0;
for (int i=0; i<s.length(); ++i) {
curpow *= mulbase;
result += s[i]*curpow;
}
return result;
}

Submit

Submit your program according to the instructions on the submit page.