Abstract Data Types and Data Structures

  1. Allegedly, this course is about Data Structures. That is not exactly true- it is about Abstract Data Types and their implementation with Data Structures. Someday I'll get around to changing the title of the course.
  2. So lets define what we're talking about, starting with Abstract Data Types (ADTs). What is an ADT?
    1. Type- what is a type? Roughly speaking, anything we can declare a variable as is a type. In Java, that means int, Integer, char, float, are all types, but so are any classes that you've defined.
    2. Data- well, any variable is 'data', so strictly speaking, this word is redundant, but data usually means lots of information that is stored in a computer. ADTs are usually concerned with collections of values rather than simple base-types like char or boolean.
    3. Abstract - wow, that word is a doozy. Merriam-Webster says: 1 a: disassociated from any specific instance <an abstract entity>, 2: expressing a quality apart from an object <the word poem is concrete, poetry is abstract>. For our purposes here, this is sufficient. The point is that it's not an actual data type, but a description of one. That is, it describes how to use the data, but not how the actual storage is implemented.
    4. Sound familiar? It should- you've already learned about the separation of implementation and interface in Object Oriented programming. From that perspective, an ADT is a formally defined interface.
    5. What this class is really about is a set of commonly used ADTs that the trade-offs in implementing them various ways.
  3. A Data Structure is the concrete implementation of the ADT. At the most basic, Data Structures are variables, arrays, linked lists and other similar linked structures, plus the functions that make them behave in the manner necessary to implement the ADT.
  4. It should be noted in the interest of honesty that sometimes you can implement an ADT with another ADT, as long as that second ADT is implemented by a data structure. (theoretically, you could construct an arbitrarily long chain of ADTs implementing ADTs, as long as there is a real data structure at the bottom.
  5. The ADTs that may be mentioned in this class are:
    1. List
    2. Stack
    3. Queue
    4. Tree
    5. Priority Queue
    6. Graph
    7. Map
  6. In that set of ADTs there are really two kinds. There are containers and there are relational data types.

Lists

Stacks

  1. We'll next talk about the ADT Stack.
  2. The two primary operations on a Stack are
  3. and sometimes
  4. How do they play out? This as if you literally made a stack (as in pile) of the stuff. As you add new stuff, it goes on the top of the pile, if you take away from it, the item must come from the top.
  5. The following table shows operations on a stack of characters, with the top to the right:
    push('a');'a'
    push('b');'a','b'
    push('c');'a','b','c'
    push('d');'a','b','c','d'
    pop();'a','b','c'
    pop();'a','b'
    push('e');'a','b','e'
    push('f');'a','b','e','f'
    pop();'a','b','e'
    pop();'a','b'
    pop();'a'
    pop(); 
  6. Stack are good for a few things:
    1. Reversing stuff. We we have a bunch of things in some order and we want to reverse them. Handy for finding palindromes:
      def is_palindrome(c):
          l = len(c)
          s = []
          for i in range(l // 2):
              s.push(c[i])
          
          k = -(-l // 2)  # Ceiling division
          for j in range(k,l):
              if s.pop() != c[j]:
                  return False
          return True
      
    2. Going back the way we came. Hansel and Gretel's breadcrumbs are like a stack. When finding the way back home, you look at them in the reverse order.
      1. Maze solving. At each junction, push it on the stack. When you hit a dead end, pop back to last junction, go in new direction. If there is no new direction, pop again.
      2. Call stacks. Each time a function call is made, push it on the call stack. When you return, pop off the stack. The function you return to is now on the top of the stack.
    3. Matching things like parentheses. Really just a special case of the palindrome example. Each time you see an open paren, push something on the stack. Each time you see a close, pop the stack. If you pop an empty stack, there are too many close parens. If you get to the end of the input, there are too many open parens.
    4. Computation of stack-like languages. This is a big topic, but we can look at a 'simple' one, reverse polish notation, or RPN.
  7. RPN
    1. RPN is a way or representing arithmetic expressions.
    2. Instead of placing the operator between two sub-expressions: 3+2 or (3 + 2) * (4 - 1), the operator is place after the sub-sexpressions: 3 2 + or 3 2 + 4 1 - *.
    3. This has the propert that it is never ambiguous and does not require parentheses. To the expression 3 + 2 * 4 - 1, a mathematician might say it's undefined or ambiguous. A computer scientist might say it measn (3 + (2 * 4)) - 1. Either way, there is no way of writing (3 + 2) * (4 - 1) without parentheses.
    4. The RPN version of (3 + (2 * 4)) - 1 would be 3 2 4 * + 1 -.
    5. The algorithm for evaluating an RPN expression is, of course, stack based. It is also recursive. The basic idea is that when there is an operator on the top of the stack, the next two things on the stack are either numbers, or other expressions. This suggests the following evaluation function:
      def RPNEval(s) #s is a stack
         n = s.pop()
         if (isinstance(n,(int,float))
            return n
         else //n is operator
            right = RPNEval(s)
            left = RPNEval(s)
            return n(left,right)
      
    6. We can run through and see that this works, but not enough room in these notes for that
    7. There is also a very famous algorithm due to Edsger Dijkstra that converts infix notation to RPN. (Edsger will come up a lot in this course, so here's a picture:)
    8. The algorithm is called the Shunting-yard algorithm because it resembles a railroad shunting yard. In a Shunting yard, some cars on a train are shunted off to a short bit of dead-end track, which operates much like a stack, in order to change the order of the cars.
    9. In this algorithm, the terms of the expression are processed one at a time (like rail cars), but the operators are shunted off onto the stack to be later popped and added back into the "train".
      //for simplicity, we will assume all operators are left associative.
      Stack shuntingYard(inputExpression)
         while (there still are items in inputExpression)
            n ← remove next item from inputExpression
            if (n is number)
               output.push(n)
            else if (n is operator)
               while (precidence of n is ≤ shunt.top())
                  output.push(shunt.pop())
               shunt.push(n)
            else if (n is left paren)
               shunt.push(n)
            else //n is right paren
            o ← shunt.pop()
            while (o ≠ left paren)
               output.push(o)
               o ← stack.pop()
         while (!shunt.isEmpty)
            output.push(shunt.pop())
         return output
      
  8. So, how can we implement these things? Well, we still only know 3 data structures: variables, linked-lists, and arrays.
    1. plain variables are right out. They can't hold more than one thing elegantly.
    2. But both linked lists and arrays could work.
    3. For linked lists we could use a singly linked list, and make push() be insert at the head of the list, and pop() be remove the head of the list.
    4. For arrays, we could keep track of where the next empty spot in the array is, and add and remove there. Just as with Lists, we make have to "resize" the array, but the same amortization argument holds.
    5. BUT, we could also try an implementation by another ADT. In this case we only know of one, the List.
    6. We could use a linked-list implementation of a list, and make push(e) be add(0,e) and pop() be remove(0).
    7. Or we could use an array implementation of List, and make push(e) be add(end,e) and pop() be remove(end-1,e).
  9. What are the ramifications of these implementation choices?
    1. For a linked list, push and pop are both O(1).
    2. For an array, push and pop are both O(1). Except when the array fills. See below.
    3. For a List, if we chose the correct strategy based on the List implementation, then push and pop are both O(1).
    4. This has been our first example of implemented an ADT with another ADT, and our first example of knowing how to use an ADT properly depends on understanding how it was implemented.
    5. In conclusion, for stacks, it doesn't matter which technique we use, as long as we use it right.
  10. What if the array fills? Don't we have to copy to a larger array? Wouldn't that make array worst case time O(n)? Sort of, but the best way to think of this is through amortized analysis.
    1. Amortization is an accounting technique of spreading cost of occasional pruchases across many years. Let's say your business buys a $100,000 machine every 10 years, how would you run your books? You could make money 9 out of 10 years, and lose big the 10th year, or you could amortize the cost, saying the the machine costs you $10,000 per year. We can do the same thing with run times.
    2. take the following table, starting out with an array of size 1, and doubling each time you need more space:
      sizecost for each push
      11 2 (second push requires a copy, then a push)
      23 0 (amortize some cost earlier)
      23 0 3 (push requires 2 copies plus push)
      43 3 0 (amortize)
      43 3 0 1 (simple push)
      43 3 0 1 5 (copies + push)
      83 3 3 3 0 (amortize)
      83 3 3 3 0 1 1 1 9
      163 3 3 3 3 3 3 3 0 (amortize)
      163 3 3 3 3 3 3 3 0 1 1 1 1 1 1 1 17
      323 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0
      323 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 33
      643 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0
      As you can see, the amortized cost of each push never grows beyond 3, and so we can still say the cost of a push, even with resizing the array, is O(1).

Queues

  1. Next ADT is the Queue
  2. The two primary operations on a Queue are
  3. and sometimes
  4. The basic idea is first in, first out; each iten gets its turn in the order it arrived.
  5. Queues are fundamentally less interesting (to me!) than a stack because we are much more familiar with them. Everything we do that involves waiting in line is a queue (in fact the British word for line is queue).
  6. The following table shows operations on a stack of characters, with the top to the right: >
    enqueue('a');'a'
    enqueue('b');'a','b'
    enqueue('c');'a','b','c'
    enqueue('d');'a','b','c','d'
    dequeue();'b','c','d'
    dequeue();'c','d'
    enqueue('e');'c','d','e'
    enqueue('f');'c','d','e','f'
    dequeue();'d','e','f'
    dequeue();'e','f'
    dequeue();'f'
    dequeue(); 
  7. Queues on their own are pretty much good for one thing: Keeping track of the order of something.
    1. For instance in simulations, you might keep track of the order events are occurring.
    2. Or maybe in solving a problem, you come up with multiple sub-problems to solve. You might put each in a queue so they get solved in the order they came up.
  8. Sometimes queues can be interesting when used in conjunction with stacks.
    1. Given a queue containing 123456 and an empty stack, and allowing yourself to only dequeue and push the result on the stack, or pop from the stack and enqueue the result, can you rearrange the numbers in the queue to some order, such as 325641? Can you do this for any order? 154623?
    2. What if instead of inserting numbers back into the queue when popping, we just print them out? Can we print them out in any order?
  9. So, how can we implement these things? Well, we still only know 3 data structures: variables, linked-lists, and arrays.
    1. Again, plain variables are right out. They can`t hold more than one thing elegantly.
    2. Again, both linked lists and arrays could work.
    3. For linked lists we could use a singly linked list, and make enqueue() be insert at the tail of the list, and dequeue() be remove the head of the list.
    4. For arrays, we may have to "resize" the array, but the same amortization argument holds.
    5. We we have a new problem with arrays: as we enqueue and dequeue, the contents in the array "walk" to the right:
      enqueue('a');
      a       

      enqueue('b');
      ab      

      enqueue('c');
      abc     

      dequeue('c');
       bc     

      • What happends when the contents reach the right side? We ran out of room on the right, but have plenty of space. We could wrap it around. Keep an index for both the front and the back. If they are equal, the queue is empty. When something is added, place it at the location indicated by rear, and incrementing it, mod the array size:
        def enqueue(i):
          A[r] = i
          r = (r+1)%s
          if ((r = f)
            #expand array
        
        `
      • When we dequeue, increment front, mod the array size
        dequeue(i)
          if (f is r) raise Exception("Empty queue")
          tmp = A[f]
          f = (f+1)%s
          return tmp
        
    6. We could also try an implementation by Lists.
    7. We could use a linked-list implementation of a list, and make enqueue(e) be add(0,e) and dequeue() be remove(end).
    8. Or we could use an array implementation of List, but we would still have to deal with the walking problem, so it`s probably not worth it.
  10. What are the ramifications of these implementation choices?
    1. For a linked list, enqueue and dequeue are both O(1).
    2. For an array, enqueue and dequeue are both O(1). Except when the array fills, but still O(1), amortized.
    3. For a List, if we chose the linked list implementation, then enqueue and dequeue are both O(1).
    4. Just like for stacks, it doesn`t matter which technique we use, as long as we use it right.