These notes and closely review Unit 1 section 10 through 13.



Analyzing a less simple iterative algorithm: findFirstMissing with binarySearch

\[T_w(n) = \Theta(n)\]\[T_b(n) = \Theta(1)\] \[T_w(n) = \Theta(\lg n)\]\[T_b(n) = \Theta(\lg n)\] \[T_w(n) = \Theta(\lg n)\]\[T_b(n) = \Theta(1)\]
Dr. Roche, in the Unit 1 notes, kindly analyzes our three search algorithms so painstakingly that he gives us exact expressions for the worst-case running time in terms of number of statements exectued. From those functions, we can deduce $O$, $\Theta$ and $\Omega$ expressions (as seen to the right). However, this is most decidedly not the way we normally do things! If we know the exact expressions, there's not much reason to bother with $O$, $\Theta$ and $\Omega$. The usual case is that it's too difficult to get exact expressions, and we "settle" for getting information about asymptotic growth rates. So let's try analyzing findFirstMissing the usual way: we'll get what info we can about the worst-case running time function's growth rate, which will come in terms of $O$/$\Theta$/$\Omega$, without ever trying to get an exact expression for the function.
Algorithm: findFirstMissing
Input: $A$ and $B$, both sorted arrays of $n$ ints
Output: $j$, the least index such that $B[j]$ is not an element of $A$, or 'NOT FOUND' if no such element of $B$ exists
j = -1, k = -1
while k ≠ NOTFOUND and j < n-1 do
  j = j + 1
  k = binarySearch(A,0,n-1,B[j])
  return j
  return NOTFOUND
We can't go through the outer loop more than $n$ times, and binary search's worst case is $\Theta(\lg n)$, so we have $O(\lg n)$ as an upper bound on the binarySearch call inside the loop. This gives $T_w(n) \in O(n \lg n)$.

To get a lower bound on $T_w(n)$, we need to look at a concrete input. We'll look at the case in which $A$ and $B$ are equal. In this case, we go through the outer loop $n$ times and, since binarySearch's best case is $\Theta(\lg n)$, we get an $\Omega(\lg n)$ lower bound on the binarySearch call inside the loop, giveing a lower bound of $\Omega(n \lg n)$ for this case. The worst case must be at least this bad, so $T_w(n) \in \Omega(\lg n)$.

Putting these together gives $T_w(n) \in \Theta(n\lg n)$.

What happens when we replace binarySearch with linearSearch?

Let's replace binarySearch with linearSearch in findFirstMissing. The interesting thing here, is that we can't just say the call to linearSearch takes time $\Theta(n)$. We would have to prove first that it's actually possible in the context of the findFirstMissing algorithm for every call to linearSearch to be "worst case" for linearSearch, despite the fact that each call is a different $B[i]$. In fact, if we don't allow duplicate entries in array $B$, it is not possible for more than two calls to linearSearch to take $n$ steps. If you look at an intermediate iteration of findFirstMissing's while loop (i.e. not the last), the time taken for search(A,0,n-1,B[j]) is strictly greater than the time taken search(A,0,n-1,B[t]) for any t < j. So we can't assume that search takes $\Omega(n)$ for every iteration. So what do we do?

The first thing we did was to punt, and put the call to search as taking $\Omega(1)$ time, which is correct but not very precise. That gave us $T_w(n) \in \Omega(n)$ as well as $T_w(n) \in O(n^2)$.

The second thing we did was to analyze a very specific input situation, get a lower bound for that case, and note that the "worst case" had to be at least that bad, so the lower bound was also a lower bound on the worst case. I suggested the input situation in which the first half of B is identical to the second half of A, and the index n/2 element of B is not in A. In this case, we go through the while-loop $\Omega(n/2)$ times, and the call to linear search takes $\Omega(n/2)$ time, which gives us $\Omega(n^2/4) = \Omega(n^2)$ as a lower bound. Since $T_n(n)$ is $O(n^2)$ and $\Omega(n^2)$, it is $\Theta(n^2)$.

What happens when we replace binarySearch with gallopSearch?

What if we use gallopSearch in findFirstMissing. What can we say about $T_w(n)$? We pretty easily arrive at $T_w(n) \in O(n\lg n)$ and $T_w(n) \in \Omega(n)$, since gallopSearch's worst case is an upper bound on the calls in the loop and gallopSearch's best case is a lower bound on the calls in the loop. However, it's hard to figure out how to squeeze the upper and lower bounds closer to one another. We opt not to try, because we have a better version of findFirstMissing coming up anyway!

Using analysis to improve algorithm uniqueDest

Recall our uniqueDest(P,n,s) algorithm.
Algorithm: uniqueDest(P,n,s)
Inputs: P,n,s --- an input instance of the Unique Destination problem
Output: TRUE/FALSE a solution to the Unique Destination problem

next = s, count = 1

while count == 1 do
  curr = next
  count = 0, i = 0
  while i < n do     ← this loop counts the number of children of curr and sets next to the most recently seen child
    if P[i] == curr
      count = count + 1
      next = i
    i = i + 1

if count == 0
   return TRUE
   return FALSE
Recall our analysis of the algorithm uniqueDest. We noted that an input $P$ representing a "linked-list", with $s$ the head of the list would give us $n$ times through the outer loop and, of course, the inner while loop we always do $n$ times. So that gave us $T_w(n) \in \Omega(n^2)$. Moreover, our little improvement from that HW didnt make things better for this input.

So if we want to improve its worst-case running time, what is going to need to be done better? Hopefully you see that the only way to make it faster is to replace the inner loop with something that provides the same information (i.e. whether "curr" has 0, 1 or multiple children and if only one, that child's index) but does it faster. There's really no other option. We can do this with a bit of "preprocessing", but at the cost of using extra memory. What we'll do is compute and store the number of children for every node in the forest. It sounds like more work but, in fact, one pass through the array $P$ is all that's required. As we do this, we'll also store, once again for every node in the forest, the index of one of its children.

Algorithm: uniqueDest2.0(P,n,s)
Inputs: P,n,s --- an input instance of the Unique Destination problem
Output: TRUE/FALSE a solution to the Unique Destination problem

count = new array of n elements ← count[i] is the number of children of node i
next  = new array of n elements ← next[i] is one of i's children (if i has them)

for i from 0 to n-1 do ← initalize count entries to zero
  count[i] = 0

for i from 0 to n-1 do ← give count and next their proper values
  if P[i] ≠ -1
    count[P[i]]++ ← P[i] is the parent, i is the child.  So increment P[i]'s count.
    next[P[i]] = i

curr = s
while count[curr] == 1 do
  curr = next[curr]

if count[curr] == 0
   return TRUE
   return FALSE
Hopefully it's totally clear that this algorithm has $T_w(n) \in \Theta(n)$ (and $T_b(n) \in \Theta(n)$ too). So we have a much better worst-case running time! Big Idea: we can reduce the time required, but at the cost of increasing the memory required. In this case, it's a good trade! You'll often hear the phrase "time-space tradeoff" to refer to this kind of thing.

Note: The best-case time for the improved uniqueDestination algorithm from homework 03 was actually $\Theta(1)$, which is better than $T_b(n)$ for uniqueDest2.0. I'm not upset. We almost always worry more about worst-case than we do about best case. Moreover, the reduction from $n^2$ to $n$ is huge. It means that, now, any problem that we have the time/memory to actually read in, we can solve. With an $n^2$ algorithm, really large problems are out of reach.