Class 40: Extensible Arrays


Reading
Sections 17.4 of Introduction to Algorithms.

Homework
None.

Extensible (Dynamic) Tables
Arrays are great things, becuase they allow you to access any piece of information in the array (via its index) in Θ(1) time. This isn't true of, for example, linked lists or binary trees. However, one of the big hassles with using arrays is having to decide on the maximum array size when the array is first created. We're going to consider implementing arrays that can grow to fit more and more data. You've already seen such a thing in the STL vector class.

First of all, why would an extensible array be interesting? Here's a simple example: Write a program that reads in a list of numbers (terminated with a ";") from the user and prints them in sorted order. You'd like to store the numbers in an array to sort, but you won't know how big the array needs to be until the user stops entering numbers. But at that point its too late, since all the numbers you wanted to store are gone. An extensible array would keep growing as we added more and more numbers.

Our extensible array will work like the STL vector. You have two operations, one is element access using [], and the other is push_back to add a new element at the end of the array. Like the STL vector, we'll provide a size() function to allow the user to to determine how many elements are in the array.

Naive Extenisble Table
The obvious way to implement our extensible array is to simply resize the array to fit a new element when "push_back" is used. Here's a simple implementation of an extensible array class for int's. Being very fashionable, I'm calling the class, eArray.
class eArray
{
  int  *A;
  int n;
public:
  eArray()               { A = new int[1]; n = 0; }
  int& operator[](int k) { return A[k]; }
  int size()             { return n; }
  void push_back(int x)
  {
    int *Ap = A;
    A = new int[n+1];
    for(int i = 0; i < n; i++)
      A[i] = Ap[i];
    delete [] Ap;
    A[n++] = x;
  }
};
It's pretty clear that a single "push_back" could take Θ(n) time. How about a sequence of k operations on an eArray, what's the worst case they could take?

Well, the best case is when no operation is a push_back, and the worst case is when every operation is a push_back. Supposing we start with n=0 and perform k push_back's. We'll have to copy arrays of size 1,2,...,k. So we'll do

Θ(1) + Θ(2) + Θ(3) + ... + Θ(k) = Θ(k^2)
steps. We'd say then that in the worst case, the average cost per operation in a a sequence of k operations (the amortized cost) is Θ(k). So, for example, consider the following stupid program, which constructs an eArray of int's from 0 up to n, and then computes their sum.
int main()
{
  int n;
  cin >> n;
  eArray X;
  for(int i = 0; i <= n; i++)
    X.push_back(i);
  int sum = 0;
  for(int j = n; j >= 0; j--)
    sum += X[j];
  cout << sum << endl;
}
This algorithm takes Θ(n^2) time!

Is there a better way?
Suppose that instead of making the array just enough bigger to fit the new element, we doubled the array size. That way we could add many more elements before we had to resize again. Would this make a difference?
class eArray
{
  int n,c;
  int *A;
public:
  eArray() { n = 0; c = 1; A = new int[1]; }
  int& operator[](int i) { return A[i]; }
  void push_back(int x)
  {
    if (n == c)
    {
      c = 2*c;
      int *B = new int[c];
      for(int i = 0; i < n; ++i)
	B[i] = A[i];
      delete [] A;
      A = B;
    }
    A[n] = x;
    ++n;
  }
};

	
Now lets try an amortized analysis again. We'll determine the worst case time requirements for a sequence of k operations, and then divide that by k to get the amortized cost per operation. Once again, suppose the array starts off empty. The worst case is still that all k operations are push_back's. The array will ultimately have to be at least k elements long, but won't be more than 2*k elements long. Let d be the ceiling of the log base 2 of k. So the first time it gets resized it gets resized to length 1, then 2, then 4, then 8, ..., up to 2^d. If we resize to length r, then we do Θ(r) work, since we just copy. Thus, we end up doing Θ of
	    2^0 + 2^1 + 2^2 + 2^3 + .... + 2^d = 2^(d+1) - 1 = Θ(k)
	    
work in copying. Of course we do Θ(k) work in assigning values (with or without copying), but the total work is still Θ(k). Thus, the amortized cost is Θ(1), just like assigning to an array without worrying about resizing!

Check out Notes[3] in SGI STL vector documentation and you should recognize that the STL vector is doing just what we're talking about!

Amortized Analysis of Extensible Array
Let n be the number of elements in the Extensible Array, and let s be
the capacity of the array.  At all times, n ≤ s (of course) and our
doubling scheme ensures that n ≥ s/2, i.e. that we're always using at 
least half the allocated memory

Operation       Actual Cost         Amortized Cost
-----------------------------------------------
push_back          1 CU if n < s    3 CU's in either case
             n + 1 CU's if n = s

We want to prove that the "balance" B in our computational account 
is always greater than or equal to zero, since this means out
amortized payments always suffice to cover our actual costs.  We will
do this by proving that B ≥ 2*n - s after every operation, where B is the
balance, n is the number of elt's, and s is the capacity of the
allocated memory.

1)  After the first operation we've paid 3 CUs, been charged 1 CU, B =  2, 
    n = 1, and s = 1.  Thus, B ≥ 2*n - s
   
2)  Assume B ≥ 2*n - s and we perform an operation resulting in new balance 
    B', new size n', and new capacity s'.  We'll show that regardless
    of which case we're in B' ≥ 2*n' - s', i.e. that we always have
    enough of a balance to cover our computation expenses.

    CASE 1: n < s
    -------------
            B ≥ 2*n - s
    B + 3 - 1 ≥ 2*n - s + 3 - 1
           B' ≥ 2*n - s + 2
           B' ≥ 2*n + 2 - s
           B' ≥ 2*(n + 1) - s
           B' ≥ 2*n' - s'

    CASE 2: n = s
    -------------
                  B ≥ 2*n - s
    B + 3 - (n + 1) ≥ 2*n - s + 3 - (n + 1)
                 B' ≥ 2*n - s + 2 - n
                 B' ≥ 2*n + 2 - s - n
                 B' ≥ 2*(n + 1) - s - n
                 B' ≥ 2*(n + 1) - s - s
                 B' ≥ 2*n' - s'
This shows that the amortized cost of a push_back in our extensible array is constant.

Review: What does amortized analysis really mean?
What is the difference between saying that an operation runs in worst case time Θ(n) and saying that an operation runs in amortized time Θ(n)?
Truly Dynamic Arrays - add a "remove_back" operation!
In class we talked about why the ability to shrink the array with a "remove_back" operation might be desireable. But how can we implement it efficiently? "Efficiently" means that we'd like to keep the Θ(1) amortized time.

You might be tempted to say that whenever my array becomes less than half full, I should shrink the capacity by half. It isn't too hard to see that this could lead to every other operation being an expensive resizing operation, so this approach doesn't work too well.

The approach we'll take is to ensure that the capacity c and n, the number of elements in the array, always satisfies c/4 < n <= c. That way, we'll never be using more than 4 times the required memory.

Keeping the same amortized cost per operation as before (i.e. now both push_back and remove_back have amortized cost of 4 CU's) we'll show that: If n is the number of elements in the array, c is the capacity of the array, and B is our "balance" (the number of CU's to our credit), we'll show that B ≥ n + 2|n + c/2|. If we can prove that always holds, it'll mean that we never have a negative bank balance, i.e. we can always pay for our computation!

We already proved this for the push_back operation, so it only remains to prove this for the remove_back operation.

B = balance before operation,   B' = balance after operation
n = # of elt's in array before, n' = # of elt's in array after
c = capacity of array before    c' = capacity of array after

CASE 1: No resize, c = c',  (actual cost is 1)

B' = B + 4 - 1 ≥ n + 2|n - c/2| + 3 ≥ (n-1) + 2|(n-1) - c/2|
                                      =   n'  + 2|  n'  - c'/2|

CASE 2: Resize required, i.e. n = c/4 + 1 & n' = c'/2
        (actual cost is n-1)

B' = B + 4 - (n-1) ≥ n + 2|n - c/2| + 5 - n
                    = n + 2|n - n/2| + 5 - n
                    = n + 2|n/2| + 5 - n
                    = n +   n    + 5 - n
                    = n + 5
                    ≥ (n-1) = (n-1) + 2| 0 |
                             = (n-1) + 2|n' - c'/2|


Christopher W Brown
Last modified: Tue Apr 27 13:03:27 EDT 2004