Problem 90

This is the archived version of this course from the Spring 2013 semester. You might find something more recent by visitning my teaching page.

Multiple substring search

Due: April 26
Points: 5

In class, we talked about the Karp-Rabin algorithm for searching for a substring $$s$$ of length $$m$$ inside a larger text file $$T$$ of length $$n$$.

Now consider an extension of this problem: searching for multiple substrings $$s_1, s_2, s_3,\ldots, s_k$$ inside $$T$$. Assume each substring has the same length, $$m$$ bits. We want to know the first time any of the $$k$$ substrings occurs inside $$T$$.

This is an important problem! For example, you can imagine someone might want to search millions of emails for words like "bomb", "terrorist", "qaeda", "laden", and so on.

First, tell me what the running time would be if we simply repeated the KR algorithm $$k$$ times.

Next, come up with a faster way to do this. The idea I have in mind is to store the fingerprints of all $$k$$ substrings inside a Bloom filter, and then use this to check each substring in the text.

Yes, I said Bloom filter! Describe an improved algorithm, and tell me what the running time of your improved algorithm for multiple substring search is.