Maps

  1. Map is an ADT that holds an element and key, like priority queues, but the key is an identifier, something used to find the element. Literally, something that can be mapped to something else.
    stolen from mathisfun.com
  2. This is derived from the term 'mapping' in mathematics meaning 'function'. That is, a function that takes and argument and returns something, it maps the argument to the return value.
    1. For example take a function you're familiar with, f(x) = x2. This function maps (usually) real numbers to their squares. It maps 1 to 1, 2 to 4, 3.2 to 10.24.
    2. But we can do this for any sets. We could take course numbers as arguments and get available section numbers are return values, so f(IC312) = {1001,1002,5001,6001}.
  3. Useful for: Telephone directiories, dictionaries, cache, IP address lookup, what else? For each of these, what is the key? What is the element?
  4. Operations: insert(k,e), get(k), remove(k).
  5. Basic implementations:
    1. Unsorted array, insertion easy, get and remove hard: O(1),O(n),O(n).
    2. Sorted array, insertion hard, get medium, remove hard: O(n), O(lg n), O(n)
    3. Unsorted Linked List, insertion easy, get and remove hard: O(1),O(n),O(n).
    4. Sorted Linked List, insertion hard, get and remove hard. O(n), O(n),O(n).
    5. These times look just like they do for priority queues, which makes us think, maybe we can come up with something clever, like
  6. Hash table
    1. What if the keys fell in the range from 0 to N, the size of the array? Then we could do everything in O(1) couldn't we?
    2. Problem 1: the keys aren't always ints.
    3. Problem 2: The range of keys could be too larger, requiring a huge array.
    4. But what's good about it? If the key can be converted to the index in constant time, the array look up can be don in constant time, so insert, get, and remove are all easy: O(1).
    5. Obviously, most times the universe of keys is so large, that this can't actually work, but what if we map the keys to a smaller set h(k), so that it is a many to one mapping, and each h(k) is an array index.
    6. Then the array can be made much smaller, but now we have the problem that we might try to put multiple elements in the same array element.
    7. Solution: separate chaining. Make each cell of the array be a List of the elements stored there.
  7. Hash functions
    1. How do we make a good hash function? It should ideally scatter the elements evenly, so the chains never get long.
    2. In java, it is traditional to consider this as two steps: conversion to a number, followed by a compression to the right range, usually 0 to N-1 to fit in the array.
    3. Every object in java has a built in hashCode() function, but this is often naive: use the address of the object. This has the problem that two otherwise equal objects would have different hash codes. That's why many classes, such as String, override hashCode().
    4. A simple technique for purely numeric data is to just cast it to an int. For char that works well, but for something like a float it doesn't do what we'd like. Instead, we can treats the bits in the float as if it were an int: Float.floatToIntBits(x)
    5. For things bigger than an int, we can sum up its parts. Take a long, its twice an int, but dropping off 1/2 the bits might result in lots of collisions. Instead we sum the two halves of the float: (int)((f >> 32) + (int) f)
    6. Summing up the characters in a word however means that all anagrams of the word have the same code, "NavalAcademy", "ACavemanLady", "AVenalCadYam" and "ALacyMeadVan" all have the same code. Instead, if in a string length nwe label each character x0...xn-1, and some integer a, we can make a polynomial hash: x0an-1+x1an-2+ ... + xn-2a+xn-1
    7. Polynomial hashes can be efficiently computed with: xn-1 + a(xn-2 + a(xn-3 + ... a(x2+ a(x1+ ax0))...)) The book claims that = 33, 37, 39, 41 are good for English words.
    8. Host of other techniques. The book mentions cyclic shift, where the characters are bit shifted before they are summed. Because hashes are useful in security (for determining if data has been modified) there are a host of complex ones that hopefully has good spreading behavior: MD-5, SHA-1, etc.
  8. Compression functions
    1. Obvious: i mod N. Make sure N is prime. But if we get really unlucky and the data is of the form pN + q for multiple p's, then we get collisions.
    2. Works better: take a prime number p > N, and two integers 0 < a < p, and 0 &le b < p use: ((ai+ b) mod p) mod N.
  9. How long does it take? It is still O(1) to map to the cell of an array, but worst case is that all the elements are in the same cell.
  10. This time, however, lets look at the average case. If the table is sufficiently big, and the h(k) sufficiently uniform (the probability that a k maps to a h(k) is evenly spread across all h(k)), then how long are the chains? We expect them to be some small constant number in length, so the operations are again O(1).
  11. Loading and Rehashing
    1. We talk about the load factor α = n elements/m slots. When α ≥ 1, we're in real trouble. We find in practice that for separate chaining, we need to stop and make a bigger hash table when α > 0.75 (approximately), and definitely by 0.9.
    2. When the table gets too full? Double it and rehash everything.

Ordered Maps: Binary Search Trees

This page contains course notes for Data Structures taught in the Autumn of 2011. If you have come here from a search engine or the like, you may wish to visit the course home page for more material, or visit my general teaching page for possibly more up to date versions.
  1. Let's talk about a new ADT, or rather an extension of the old ADT, Maps. These extensions are called Ordered Maps.
    1. An ordered map is just like a Map, but it has some additional operations:
    2. first() - returns the element with the smallest key,
    3. last()- returns the element with the largest key,
    4. next(k) returns the element with the smallest key larger than k, and
    5. prev(k)- returns the element with the largest key smaller than k.
  2. OK, let's look at the naive implementations: Unsorted Array, Sorted Array, Unsorted Linked-List:
     Unsorted ArraySorted ArrayUnsorted Linked List
    insert(k,e)O(1)O(n)O(1)
    get(k)O(n)O(lg n)O(n)
    remove(k)O(n)O(n)O(n)
    first()O(n)O(1)O(n)
    last()O(n)O(1)O(n)
    next(k)O(n)O(lg n)O(n)
    prev(k)O(n)O(lg n)O(n)
  3. What about hashtables? Their performance is the same as the unsorted structures above.
  4. So hashtables are good implementations of Maps but they have three weaknesses, they have poor worst case bounds, they may require resizes, making them poor at real time tasks, and they're not good at ordered Map operations
  5. What we want is a linked structure (for easy inserts/deletes) but that has log(n) finds.
  6. Linked structure.... log(n)....we want trees! Binary Search Trees, to be precise.
  7. A BST is a binary tree where an item (and key) is stored at each node. it has the following 2 properties:
    1. ∀n,m LeftDescendant(m,n) → key(m) < key(n)
    2. ∀n,m rightDescendant(m,n) → key(m) > key(n)
  8. Lets try one with the following set of numbers:
    51 19 99 96 13 15 21 88 71 54 60 55 83 29 72
    def get(k):
      return search(k,root()).data
    
    def search(k,v):
      if v is None:
        return v
      if key(v)==k:
        return v
      if k < key(v):
        return search(k,left(v))
      if k > key(v):
        return search(k,right(v))
    
  9. How long does that take (in terms of the height of the tree)? O(h). More about this later.
  10. OK, what about insert? We find the leaf where it is supposed to go, and then we put it there:
    insert(key,value):
        root = insert_rec(root,key,value)
    
    insert_rec(node,key,value):
      if node is None:
        return new Node(key,value)
      elif key< node.key:
        node.left = insert_rec(node.left,key,value)
      elif key > node.key:
        node.right = insert_rec(node.right,key,value)
      elif key == node.key:
        node.value = value
      return node
    
  11. This also takes O(h).
  12. Deleting is trickier. Our cases are:
    1. We're deleting something with no children. That's easy- We just remove it
    2. We're deleting something with only one child, such as 88. We just delete it and replace it with the child
    3. We're deleting something with two children, such as 51. We want to alter the tree as little as possible, right? Could we find a node to swap with 51 that would require no tree restructuring? (E.G. swapping with 71 would not work) Either 29 or 54.
    4. What if we were deleting 71? Either 60 or 72.
    5. What do these nodes have in common? They are the next highest or lowest k in the tree. It doesn't matter which we pick, so lets go with the next highest.
    6. Why does that work? well, all right descendants are greater than k, but the left-most has nothing less than it in its part of the tree. It is the next key up. Thus swapping with it shouldn't change the tree.
  13. So what do we do?
      remove(key):
        root = remove_rec(root,key)
    
      remove_rec(node,key):
        #base case, we've walked off the end of the tree, key is not here
        if not node:
          return node
        #go left or right as appropriate in the search for the key
        if key < node.key:
          node.left = remove_rec(node.left, key)
        # If the key is larger than the current_node's key, then it's in the right subtree
        elif key > node.key:
          node.right = remove_rec(node.right, key)
        # This is the node to be deleted
        else:
          #if the node has only one child:
          if not node.left:
            return node.right
          elif not node.right:
            return node.left
          #two children, swap with next_node:
          else:
            #first, swap
            next = next_node(node)
            node.key = next.key
            node.value = next.value
            #Now we need to go delete that other node whose data we just moved up here 
            node.right = remove_rec(node.right, node.key) 
            #notice we change the key we're looking for
    
  14. whew.
  15. Oh, wait, what is next_node()?
    def rightThenLeftMost(v):
      return leftMost(right(v))
    
    def leftMost(v)
      if left(v) is None:
        return v
      return leftMost(left(v))
    
  16. OK, how long does this take? O(h)!
  17. Why does it find the next largest than v?
    1. Everything to the right of v is larger than v.
    2. if v is a left child of parent(v) then key(parent(v)) > key(v). But if v has a right child then key(right(v)) < key(parent(v), so the next largest must be in that subtree.
    3. The smallest node in a subtree is the leftmost node.
    4. And by the way, the leftmost node can have no left child, so it can be deleted easily.
  18. Given what we know now, finding next(k) shouldn`t be too bad:
    1. first we need to find(k).
    2. if that node has a right child, then next is just the rightThenLeftMost descendant.
    3. But what if there is no right child?
    4. Then the next highest is the first ancestor that is to the right:
      def next(k):  
        n = get(k)
        if n is None:
          return None
        if n.right is None
          return greater_ancestor(n)
        else:
          return rightThenLeftMost(v)
      
      where...
        def greater_ancestor(n):
          return greater_a(n, root, None)
      
        def greater_a(current, node, successor):
          if not current:
              return successor
          if node.value < current.value:
              # Current node is a potential successor, search in the left subtree
              return find_successor_recursive(current.left, node, current)
          elif node.value > current.value:
              # Continue searching in the right subtree
              return find_successor_recursive(current.right, node, successor)
          else:
              # Node is found, return the current successor
              return successor
      
  19. The only thing left is previous, which is just the opposite of next()
  20. And thus next takes O(h).