Due: Tues, 27 March
Your scheduled presentation time:
Grading Rubric

A: Solution meets the stated requirements and is completely correct. Presentation is clear, confident, and concise.

B: The main idea of the solution is correct, and the presentation was fairly clear. There may be a few small mistakes in the solution, or some faltering or missteps in the explanation.

C: The solution is acceptable, but there are significant flaws or differences from the stated requirements. Group members have difficulty explaining or analyzing their proposed solution.

D: Group members fail to present a solution that correctly solves the problem. However, there is clear evidence of significant work and progess towards a solution.

F: Little to no evidence of progress towards understanding the problem or producing a correct solution.

ProblemFinal assessment
1
2
3
4

Instructions: Review the course honor policy: you may not use any human sources outside your group, and must document anything you used that's not on the course webpage.

This cover sheet must be the front page of what you hand in. Use separate paper for the your written solutions outline and make sure they are neatly done and in order. Staple the entire packet together.

Comments or suggestions about this problem set:
Comments or suggestions about the course so far:
Citations (be specific about websites):

1 Popularity contest

An ice cream store recently sampled many flavors for its top customers. Each customer then voted for their favorite. Now the ice cream store wants to know which flavors were really popular and received more than one-third of the votes.

So here's your task: given a list of numbers `A` of length $n$, find all numbers `x` that occur more than $n/3$ times in $A$. These are the popular numbers.

Note that there can be 0, 1, or 2 popular numbers. (It's impossible to have three things that all occur *more* than one-third of the time.)

  1. The first, basic algorithm you think of is the following:
    Algorithm: popular_basic(A)
    Input : array A of n numbers
    Output: list of the "popular numbers"
    apop = []
    for i from 0 to n-1 do
       if count(A,A[i],0,n-1) > n/3
         if A[i] not in apop
           apop.append(A[i]);
    
    Algorithm: count(A,x,i,j)
    Input : array A of numbers, a value x and a range i..j in A
    Output: number of times x occurs in A[i],...,A[j]
    c = 0
    for each a in A[i],...A[j] do
      if a == x
        c = c+1
    return c
    	  
    What's the running time of this `popular_basic` algorithm, in terms of $n$?
  2. Here's a cleverer algorithm for the same thing:
    Algorithm: popular_better(A,i,j)
    Input : array A, and indices i <= j that define an n-element range in A
    Output: list of the "popular numbers"
    apop = []
    if i == j
      apop.append(A[i])
    else
      m = (i+j)/2, n = j - i + 1 ← note: n is the number of elements in A[i],...,A[j]
      L = popular_better(A,i,m)
      for each x in L do
        if count(A,x,i,j) > n/3 ← note: this is on the range i..j, not i..m!
           apop.append(x)
      U = popular_better(A,m+1,j)
      for each x in U do
        if x not in apop and count(A,x,i,j) > n/3  ← note: this is on the range i..j not m+1..j!
          apop.append(x)
    return apop
    	  
    Convince me that this algorithm is still correct, i.e., it always returns all the popular elements in `A`.
  3. Analyze the worst-case running time of `popular_better`, in terms of $n$, the length of `A`. I would like a big-$\Theta$ bound. [Note: if you end up with a recurrence relation that is the same as one we've analyzed elsewhere, you can just refer to that analysis rather than recreate it.]
  4. Suppose the list `A` were already sorted. Describe how that would allow you to find the popular numbers even faster, in $O(n)$ time.
  5. **CHALLENGE**: Plus 1\% to everyone in your section if you can come up with a $\Theta(n)$-time algorithm for this problem, even when `A` is not sorted. I promise that it's possible! Note: you may not use a hash table and simply assume the hash function is good so you get O(1) lookups all the time, nor may you make any assumptions about the sizes of the numbers occurring in A.

2 Tug of War

There are $n$ Mids trying out for the varsity tug-of-war team. Some of them are strong and some of them are weak. For simplicity, assume that all the strong Mids have *exactly* the same strength, as do all the weak Mids. You can also assume that there is at least one strong Mid and at least one weak Mid. (In particular, $n$ must be at least two.)

Your task is to determine who the strong Mids are, to decide who will be on the team. And the only tool you have to determine who is strong and weak is running *contests*. A contest involves pitting some Mids against some others in a tug-of-war, and the outcome can be either that one side wins, or the other side wins, or they tie. This pseudocode might help clarify:

# Calling this function represents a single contest.
def winner(group1, group2):
    if group1 wins:
        return group1
    elif group2 wins:
        return group2
    else:
        return 'tie'
      

There can be any number of Mids in any contest, but it should always be the same number on each side of the contest. The side with more strong Mids (or, equivalently, with fewer weak Mids) wins. For example, here is an algorithm for $n=3$:

def strongOf3(M0, M1, M2):
    w1 = winner({M0}, {M1})
    if w1 == 'tie':
        if winner({M1}, {M2}) == {M1}:
            return {M0, M1}
        else:
            return {M2}
    else:
        if winner(w1, {M2}) == 'tie':
            return w1 + {M2}
        else:
            return w1
      
An here's an algorithm for $n=4$:
def strongOf4(M0, M1, M2, M3):
    w1 = winner({M0}, {M1})
    w2 = winner({M2}, {M3})
    if w1 == 'tie' and w2 == 'tie':
        return winner({M0,M1}, {M2,M3})
    elif w1 == 'tie':
        # w1 is a tie, but w2 is not
        if winner({M0}, w2) == 'tie':
            return {M0, M1} + w2
        else:
            return w2
    elif w2 == 'tie':
        # the opposite here; w1 is not a tie
        if winner({M2}, w1) == 'tie':
            return w1 + {M2, M3}
        else:
            return w1
    else:
        return w1 + w2
      

  1. State the number of contests performed by `strongOf3` and by `strongOf4` in each of their worst cases.
  2. Give a **lower bound** on the number of contests required to determine who the weak Mids are for an input of size $n$, in the worst case. This bound should hold for any algorithm, including algorithms not yet invented!

    State your exact lower bound as a function of $n$, showing all your work. Then state what the asymptotic big-$\Omega$ bound is that results, simplified as much as possible.

    For example of what I'm asking for, in class we showed that sorting requires at least $\lg (n!)$ comparisons (the exact bound), which is $\Omega(n\log n)$.

  3. Give an algorithm for any $n$ that determines who the strong Mids are. In describing your algorithms, you can call the Mids by their number like an array: `M[0]`, `M[1]`, ..., `M[n-1]`.

    Analyze the number of contests that your algorithm performs in the worst case (NOT the number of primitive operations, just the number of contests).

  4. Is your algorithm asymptotically optimal? Say why or why not. (Hint: there is an asymptotically optimal algorithm for this problem!)
  5. **Ultra bonus**: An algorithm is *exactly optimal* (not just asymptotically optimal) if the number of contests is exactly equal to the lower bound, for all values of $n$. You need to either develop an exactly optimal algorithm for the tug-of-war problem, or prove that no such algorithm could possibly exist.

3 Clans

Suppose you have a large number of people under surveillance. You know the name of each person, and you also know that these people are organized into clans. You don't know the names of these clans, or how many there are, but you are able to occasionally deduce, based on your surveillance, that two people are in the same clan. Your job is to be able to answer the question "what clan does person X belong to", giving that clan the name of one person from the clan. You get to choose which person is this "representative" as you go.

To be more specific: you continually read add($n$) messages, which tell you that $n$ is the name of a new person, and same($n1$,$n2$) messages, which tell you that $n1$ and $n2$ have been discovered to be from the same clan. Additionally, you get clan?($n$) messages to which you should respond by giving the clan $n$ belongs to, named by the representative you've chosen for it.

Example:
  add(joe)
  add(sue)
  add(eve)
  add(sam)
  same(joe,sam)
  clan?(sam)     ← might say "clan joe" (these are all "might" because you can choose who is the "representative")
  clan?(eve)     ← would say "clan eve" (at this point, eve is a clan of one, as far as you know)
  same(sam,eve)
  clan?(eve)     ← might say "clan joe"
The last bit shows the interesting part. When you discover two people, $n1$ and $n2$, are in the same clan, that means that the clans you thought $n1$ belonged to is in fact the same as the clan you thought $n2$ belonged to. So you have to "merge" them into one clan so that they all share the same representative.

Here's a very inefficient data-structure (VIDS) that supports add, same and clan? operations:

  • init : create $A$, an extensible array / arraylist of length-2-string-arrays. $A[i]$ will represent a person. $A[i][0]$ is the person's name, $A[i][1]$ is the name of $A[i][0]$'s clan representative.
  • add(n) : is implemented by a "push_back" on $A$ of $[n,n]$. Note: this means that initially $n$ is a clan of one. Time: this operation runs in amortized $O(1)$, since that's the time for a push_back on an extensible array.
  • clan?(n) : is implemented by searching $A$ for entry $i$ such that $A[i][0]$ is $n$, and then returning $A[i][1]$. Time: $O(m k)$, where $m$ is the size of $A$ and $k$ is a bound on the length of the names.
  • same(n1,n2) : is implemented by first letting c1 = clan?(n1) and c2 = clan?(n2), and then simply returning if c1 == c2, and otherwise by going through $A$ and doing if (A[i][1] == c2) A[i][1] = c1; (or vice versa if you want n2's representative to be the overall representative). This ensures that everyone in n2's clan gets joined to n1's clan. Time: $O(m k)$, where $m$ is the size of $A$ and $k$ is a bound on the length of the names.

  1. Let $T_w(r,k)$ be the worst case running time of VIDS on a sequence of $r$ operations involving names whose length is at most $k$ [note: we are starting from an empty VIDS]. Clearly, $T_W(r,k) \in O(r^2k)$. Prove $T_w(r,k) \in \Omega(r^2 k)$.
  2. Describe an improved data structure supporting these operations. It must be better than $T_w(r,k) \in \Theta(r^2k)$! Clearly describe init, add, clan? and same.
  3. Analyze $T_w(r,k)$ for your improved data structure.

4 Calculating a spreadsheet

A simple spreadsheet consists of an infinite grid of cells, each indexed by a row and column position starting at (0,0). Each cell is either blank, contains data, or contains a formula. We will represent a spreadsheet by a list of non-empty cells defined either as
  • (d (i,j) val) - where the meaning is that cell (i,j) contains data val, or
  • (f (i,j) ((i1,j1),....,(ik,jk)) f(x1,x2,...,xk)) - where the meaning is that cell (i,j) is the result of a calculation evaluating function f with argument x1 coming from cell (i1,j1), x2 coming from cell (i2,j2), etc.
Example description Example rendering as spreadsheet
(f (0,2) ((0,0),(0,1)) average)
(d (1,1) 23)
(f (1,2) ((1,1),(0,2)) sum)
(d (0,0) 14)
(f (2,2) ((0,2)(1,2)) average)
(d (0,1) 16)
	    
0 1 2 ...
0 14 16 15
1 23 38
2 20.5
  1. Describe the data structure you would use to store a spreadsheet described in the above syntax. (Hint, look to the problems below to decide exactly what that data structure is!)
  2. Give an efficient algorithm for evaluating a spreadsheet described in the syntax shown above, assuming it is already read-in and stored in the structure you described in the previous part. Your algorithm should produce a sequence of locations of "function" cells \[ (i_1,j_1),(i_2,j_2),\ldots,(i_r,j_r) \] which is interpreted as "first evaluate $(i1,j1)$, then evaluate $(i2,j2)$, then evaluate $(i3,j3)$, etc.". For the example above, $(0,2),(1,2),(2,2)$ is OK output. However, $(1,2),(0,2),(2,2)$ is not, because evaluating cell $(1,2)$ requires the value of cell $(0,2)$, so $(1,2)$ can't be evaluated until after $(0,2)$.
  3. Define a reasonable notion of "input size" for this problem, and give a worst-case analysis of your algorithm in terms of this input size. Give a separate big-O for reading in and creating the data structure, and for the algorithm itself assuming the data structure exists.
  4. Not all spreadsheets can be evaluated. For example, if we have cell (3,4) defined like this:
    (f ((8,5),(8,6),(8,7)) sum)
    but cell (8,7) is empty, then we can't evaluate the spreadsheet completely. More perniciously, we can't have a situation where cell (i1,j1) requires cell (i2,j2)'s value to compute, but cell (i2,j2) requires (i1,j1)'s value to compute. If we tried to evaluate a spreadsheet with a cyclic dependency like this, we'd end up in an infinite loop. Give an algorithm for determining whether a spreadsheet (described as shown in the example above) has a cyclic dependency that would cause it to be un-evaluatable ... which is a word I just made up.