Reading

These notes and closely review Unit 1 section 10 through 13.

Homework

Homework

Revisiting algorithm findFirstMissing

Let's revisit the "findFirstMissing" algorithm from quiz problem Q2, which we discussed in detail in Class 4.
Inportant note: for this discussion we will assume that duplicates are not allowed in B!
Here's an observation that will allow us to make the algorithm faster. If we're at the while loop test condition, and k ≠ NOTFOUND, Then we know we found B[j] and position k in A. Therefore, there's no reason to look for B[j+1], which is bigger than B[j], in A[0],A[1],...,A[k]. This allows us to do our next search from indices $k+1$ to $n-1$, rather than from $0$ to $n-1$. That's got to be an improvement, right?
Algorithm: findFirstMissingv2.0
Input: $A$ and $B$, both sorted arrays of $n$ ints Note: assume $B$ has no duplicates.
Output: $j$, the least index such that $B[j]$ is not an element of $A$, or 'NOT FOUND' if no such element of $B$ exists
j = -1, k = -1
while k ≠ NOTFOUND and j < n-1 do
  j = j + 1
  k = search(A,k+1,n-1,B[j])
if k = NOTFOUND
  return j
else                         
  return NOTFOUND
	    

So, first let's assume the improved findFirstMissing uses binary search. Is the algorithm's worst case running time better? This is an interesting question. First of all, it will undoubtly run faster in the worst case (indeed, in almost all cases!) than the original algorithm. On the other hand, it's worst-case running time is still $\Theta(n \lg n)$ [How could we prove such a thing!?!], so the asymptotic growth rate is the same. What does this mean?!?!?!? Remember, the actual worst-case running time functions both look like a constant times $\lg n$ plus some slower growing terms. Its just that with our asymptotic analysis, we don't know what those constants are. This "better" version of findFirstMissing with binary search is better in the sense that the mysterious constant (which we don't know) is undoubtedly smaller than it was before. However, it's not better in terms of asymptotic growth rate. Generally, when we reduce the hidden constant, we say we've "tweaked" the algorithm, but we only say we've "improved" it when the growth rate is smaller.

So now let's see whether the new findFirstMissing with linear search is an improvement. When we do the analysis we get ... a gap between our upper and lower bounds on the worst-case running time. We get $T_w(n) \in O(n)$ and $T_w(n) \in \Omega(1)$ So we don't really know what's going on. It might be a real "improvement", or it might simply be "tweaking" and our analysis is just not good enough to know which is the case!

Other measures of difficulty and an even better findFirstMissing

In analyzing linear search, we took the number of elements in the range we're searching, i.e. "n", to be the measure of the problem's "size" or "difficulty". It's fair to consider other measures of size/difficulty as well, and see if analysis with regards to them is more enlightening. We called the number of elements in the range $n$. Let's use $m$ to denote the number of elements from the start of the range to the first occurence of something greater than or equal to $x$ (or the end of the range). In this case, the worst and best case running times for linearSearch would be $\Theta(m)$.

With this new view of linearSearch's running time, let's revisit our analysis of the worst case running time for the new findFirstMissing. Instead of accounting for the while loop by multiplying the number of iterations times an upper (or lower) bound on the cost of a executing the loop body, let's "unroll" the loop and sum up the costs of each and every iteration. The key observation is that the cost of the linearSearch in the "$t$th" iteration through the while-loop is $\Theta(m)$, where $m$ is from the above. This means there is some constant $c$ such that $cm$ is an upper bound on the worst case running time. Let's use $m_t$ to denote that $m$-value for the $t$th iteration. Then an upper bound on all the loop iterations (let's say there are $r$ of them) is: \[ c m_1 + c m_2 + \cdots + c m_k = c(m_1 + m_2 + \cdots + m_r) \] But since we start the search after where the previous search ended, the sum $m_1 + m_2 + \cdots + m_r$ must be at most $n$, the size of the array $A$. So $c n$ is an upper bound on the cost of all the loop iterations together, which means the algorithm is $O(n)$. Since we already knew the worst case was $\Omega(n)$ we see that the worst case running time is $\Theta(n)$. This is pretty cool: the improved findFirstMissing is better with linearSearch than with binarySearch, even though binary search is better than linearSearch! That's why we do analysis, to discover such things!