Due: Friday, March 10th
Your scheduled presentation time:
Grading Rubric

A: Solution meets the stated requirements and is completely correct. Presentation is clear, confident, and concise.

B: The main idea of the solution is correct, and the presentation was fairly clear. There may be a few small mistakes in the solution, or some faltering or missteps in the explanation.

C: The solution is acceptable, but there are significant flaws or differences from the stated requirements. Group members have difficulty explaining or analyzing their proposed solution.

D: Group members fail to present a solution that correctly solves the problem. However, there is clear evidence of significant work and progess towards a solution.

F: Little to no evidence of progress towards understanding the problem or producing a correct solution.

ProblemFinal assessment

Instructions: Review the course honor policy: you may not use any human sources outside your group, and must document anything you used that's not on the course webpage.

This cover sheet must be the front page of what you hand in. Use separate paper for the your written solutions outline and make sure they are neatly done and in order. Staple the entire packet together.

Comments or suggestions about this problem set:
Comments or suggestions about the course so far:
Citations (be specific about websites):

1 Popularity contest

An ice cream store recently sampled many flavors for its top customers. Each customer then voted for their favorite. Now the ice cream store wants to know which flavors were really popular and received more than one-third of the votes.

So here's your task: given a list of numbers `A` of length $n$, find all numbers `x` that occur more than $n/3$ times in $A$. These are the popular numbers.

Note that there can be 0, 1, or 2 popular numbers. (It's impossible to have three things that all occur *more* than one-third of the time.)

  1. The first, basic algorithm you think of is the following:
    Algorithm: popular_basic(A)
    Input : array A of n numbers
    Output: list of the "popular numbers"
    apop = []
    for i from 0 to n-1 do
       if count(A,A[i],0,n-1) > n/3
         if A[i] not in apop
    Algorithm: count(A,x,i,j)
    Input : array A of numbers, a value x and a range i..j in A
    Output: number of times x occurs in A[i],...,A[j]
    c = 0
    for each a in A[i],...A[j] do
      if a == x
        c = c+1
    return c
    What's the running time of this `popular_basic` algorithm, in terms of $n$?
  2. Here's a cleverer algorithm for the same thing:
    Algorithm: popular_better(A,i,j)
    Input : array A, and indices i <= j that define an n-element range in A
    Output: list of the "popular numbers"
    apop = []
    if i == j
      mid = (i+j)/2, n = j - i + 1 ← note: n is the number of elements in A[i],...,A[j]
      L = popular_better(A,i,m)
      for each x in L do
        if count(A,x,i,j) > n/3 ← note: this is the original range i..j!
      U = popular_better(A,m+1,j)
      for each x in U do
        if x not in apop and count(A,x,i,j) > n/3  ← note: this is the original range i..j!
    return apop
    Convince me that this algorithm is still correct, i.e., it always returns all the popular elements in `A`.
  3. Analyze the worst-case running time of `popular_better`, in terms of $n$, the length of `A`. I would like a big-$\Theta$ bound. [Note: if you end up with a recurrence relation that is the same as one we've analyzed elsewhere, you can just refer to that analysis rather than recreate it.]
  4. Suppose the list `A` were already sorted. Describe how that would allow you to find the popular numbers even faster, in $O(n)$ time.
  5. **CHALLENGE**: Plus 1\% to everyone in your section if you can come up with a $\Theta(n)$-time algorithm for this problem, even when `A` is not sorted. I promise that it's possible! Note: you may not use a hash table and simply assume the hash function is good so you get O(1) lookups all the time, nor may you make any assumptions about the sizes of the numbers occurring in A.

2 Tug of War

There are $n$ Mids trying out for the varsity tug-of-war team. Some of them are strong and some of them are weak. For simplicity, assume that all the strong Mids have *exactly* the same strength, as do all the weak Mids. You can also assume that there is at least one strong Mid and at least one weak Mid. (In particular, $n$ must be at least two.)

Your task is to determine who the strong Mids are, to decide who will be on the team. And the only tool you have to determine who is strong and weak is running *contests*. A contest involves pitting some Mids against some others in a tug-of-war, and the outcome can be either that one side wins, or the other side wins, or they tie. This pseudocode might help clarify:

# Calling this function represents a single contest.
def winner(group1, group2):
    if group1 wins:
        return group1
    elif group2 wins:
        return group2
        return 'tie'

There can be any number of Mids in any contest, but it should always be the same number on each side of the contest. The side with more strong Mids (or, equivalently, with fewer weak Mids) wins. For example, here is an algorithm for $n=3$:

def strongOf3(M0, M1, M2):
    w1 = winner({M0}, {M1})
    if w1 == 'tie':
        if winner({M1}, {M2}) == {M1}:
            return {M0, M1}
            return {M2}
        if winner(w1, {M2}) == 'tie':
            return w1 + {M2}
            return w1
An here's an algorithm for $n=4$:
def strongOf4(M0, M1, M2, M3):
    w1 = winner({M0}, {M1})
    w2 = winner({M2}, {M3})
    if w1 == 'tie' and w2 == 'tie':
        return winner({M0,M1}, {M2,M3})
    elif w1 == 'tie':
        # w1 is a tie, but w2 is not
        if winner({M0}, w2) == 'tie':
            return {M0, M1} + w2
            return w2
    elif w2 == 'tie':
        # the opposite here; w1 is not a tie
        if winner({M2}, w1) == 'tie':
            return w1 + {M2, M3}
            return w1
        return w1 + w2

  1. State the number of contests performed by `strongOf3` and by `strongOf4` in each of their worst cases.
  2. Give a **lower bound** on the number of contests required to determine who the weak Mids are for an input of size $n$, in the worst case. This bound should hold for any algorithm, including algorithms not yet invented!

    State your exact lower bound as a function of $n$, showing all your work. Then state what the asymptotic big-$\Omega$ bound is that results, simplified as much as possible.

    For example of what I'm asking for, in class we showed that sorting requires at least $\lg (n!)$ comparisons (the exact bound), which is $\Omega(n\log n)$.

  3. Give an algorithm for any $n$ that determines who the strong Mids are. In describing your algorithms, you can call the Mids by their number like an array: `M[0]`, `M[1]`, ..., `M[n-1]`.

    Analyze the number of contests that your algorithm performs in the worst case (NOT the number of primitive operations, just the number of contests).

  4. Is your algorithm asymptotically optimal? Say why or why not. (Hint: there is an asymptotically optimal algorithm for this problem!)
  5. **Ultra bonus**: An algorithm is *exactly optimal* (not just asymptotically optimal) if the number of contests is exactly equal to the lower bound, for all values of $n$. You need to either develop an exactly optimal algorithm for the tug-of-war problem, or prove that no such algorithm could possibly exist.

3 Dr. Roche is rich and you are NIST

Dr. Roche has developed a brand new public-key cryptosystem called DAN and the government has agreed to adopt DAN as a new encryption standard to replace RSA. Your task is to make recommendations to the government on what key lengths to use with the DAN cryptosystem in various settings.

All that you know about the DAN cryptosystem is that it consists of four functions, KEYGEN, ENCRYPT, DECRYPT, and CRACK, which are the best ways the creator of DAN (me!) has come up with to do those four tasks. All of these functions have a running time which depends on the key length $k$ as follows:

  • KEYGEN runs in $O(k^4)$ time.
  • ENCRYPT runs in $O(k\log k)$ time.
  • DECRYPT runs in $O(k^2)$ time.
  • CRACK runs in $O((1.1)^k)$ time.
You also have timing information on these algorithms, running on my laptop, using only a single CPU core, and with keylength $k=100$:
  1. KEYGEN took 10 seconds with keylength $k=100$.
  2. ENCRYPT took 30 seconds with keylength $k=100$.
  3. DECRYPT took 1 second with keylength $k=100$.
  4. CRACK took 20 hours with keylength $k=100$.
Based on the information above, and any other research you do, decide on key length recommendations for the following scenarios. There is of course not a single correct answer; rather I am interested in your logical reasoning and how you came up with the lengths you recommend.

You might need to do some outside research on things like the relative speed of supercomputers, cell phones, and laptops, or the projected costs of computing in the future. That's great, but be sure to cite any sources that you used.

Feel free to be somewhat speculative, but don't expect to get credit for "cute" answers such as "key length 0, because I already installed a keylogger" or "tell the government that DAN is not properly vetted and shouldn't be used yet". You have to come up with plausible, numerical recommendations!

  1. What keylength do you recommend for a smartphone app for anonymously sending pictures to a specified individual, which are deleted as soon as they are viewed? Assume that a new key is generated for each picture that is sent.
  2. What keylength do you recommend for the security certificate of a comic book dealer's website in Arkansas? The total revenue over this website is 1 million dollars per year, and the same certificate will be used for 10 years before it expires.
  3. What keylength do you recommend for nuclear weapons launch codes? Assume these are re-generated every day and old ones won't ever work again.

4 Calculating a spreadsheet

A simple spreadsheet consists of an infinite grid of cells, each indexed by a row and column position starting at (0,0). Each cell is either blank, contains data, or contains a formula. We will represent a spreadsheet by a list of non-empty cells defined either as
  • (d (i,j) val) - where the meaning is that cell (i,j) contains data val, or
  • (f (i,j) ((i1,j1),....,(ik,jk)) f(x1,x2,...,xk)) - where the meaning is that cell (i,j) is the result of a calculation evaluating function f with argument x1 coming from cell (i1,j1), x2 coming from cell (i2,j2), etc.
Example description Example rendering as spreadsheet
(f (0,2) ((0,0),(0,1)) average)
(d (1,1) 23)
(f (1,2) ((1,1),(0,2)) sum)
(d (0,0) 14)
(f (2,2) ((0,2)(1,2)) average)
(d (0,1) 16)
0 1 2 ...
0 14 16 15
1 23 38
2 20.5
  1. Describe the data structure you would use to store a spreadsheet described in the above syntax. (Hint, look to the problems below to decide exactly what that data structure is!)
  2. Give an efficient algorithm for evaluating a spreadsheet described in the syntax shown above, assuming it is already read-in and stored in the structure you described in the previous part. Your algorithm should produce a sequence of locations of "function" cells \[ (i_1,j_1),(i_2,j_2),\ldots,(i_r,j_r) \] which is interpreted as "first evaluate $(i1,j1)$, then evaluate $(i2,j2)$, then evaluate $(i3,j3)$, etc.". For the example above, $(0,2),(1,2),(2,2)$ is OK output. However, $(1,2),(0,2),(2,2)$ is not, because evaluating cell $(1,2)$ requires the value of cell $(0,2)$, so $(1,2)$ can't be evaluated until after $(0,2)$.
  3. Define a reasonable notion of "input size" for this problem, and give a worst-case analysis of your algorithm in terms of this input size. Give a separate big-O for reading in and creating the data structure, and for the algorithm itself assuming the data structure exists.
  4. Not all spreadsheets can be evaluated. For example, if we have cell (3,4) defined like this:
    (f ((8,5),(8,6),(8,7)) sum)
    but cell (8,7) is empty, then we can't evaluate the spreadsheet completely. More perniciously, we can't have a situation where cell (i1,j1) requires cell (i2,j2)'s value to compute, but cell (i2,j2) requires (i1,j1)'s value to compute. If we tried to evaluate a spreadsheet with a cyclic dependency like this, we'd end up in an infinite loop. Give an algorithm for determining whether a spreadsheet (described as shown in the example above) has a cyclic dependency that would cause it to be un-evaluatable ... which is a word I just made up.