What is a Data Structure?
In your rooms in Bancroft, you have a large amount of stuff that has to fit into a small amount of space. While the organization of some of that stuff might have been dictated to you, at some point, somebody had to decide on the placement of things in your room. When they and you made these decisions, they were balancing some tradeoffs. Does it look neat? Does it conserve space? Are the things you have to access the most often easy and quick to reach?
Organizing data in a computer is no different; we have to be able to make decisions on how to store data based on understanding these tradeoffs. How much space does this organization take? Is the operation I want to perform fast? Can another programmer understand what the heck is going on here?
As a result, we have to start with intelligent ways of discussing these tradeoffs. Which means, we have to start with...
Algorithm analysis
In order to measure how well we store our data, we have to measure the affect that data storage has on our algorithms. So, how do we know our algorithm is any good? First off, because it works - it computes what we want it to compute, for all (or nearly all) cases. At some level getting working algorithms is called programming, and at another its an abstract formal topic called program correctness. But what we mostly will talk about here is if we have two algorithms, which is better?
So, why should we prefer one algorthim over another? As we talked about, this could be speed, memory usage, or ease of implementation. All three are important, but we're going to focus on the first two in this course, because they are the easiest to measure. We'll start off talking about time because it is easier to explain- once you get that, the same applies to space. Note that sometimes (often)space is the bigger concern.
We want the algorithm that takes less time. As soon as we talk about "more" or "less", we need to count something. What should we count? Seconds (or minutes) seems an obvious choice, but that's problematic. Real time varies upon conditions (like the computer, or the specific implementation, or what else the computer is working on).
For example, here are two algorithms for computing \(f(n,x) = \sum_{k=1}^{n}\frac{x^8}{k!}\)
public double f_alg1(double x, int n) {
// Compute x^8 double x8 = 1.0; for(int i = 0; i < 8; i++) x8 = x8*x; double s = 0.0; for(int k = 1; k <= n; k++) {
double f = 1.0; for(int i = 1; i <= k; i++)// Compute n! f = f*i; s = s + x8/f; // Add next term } return s; } |
public double f_alg2(double x, int n) {
// Compute x^8 double x2 = x*x; double x4 = x2*x2; double x8 = x4*x4; double s = 0.0; double f = 1.0; for(int k = 1; k <= n; k++) {
f = f*k;// Compute n! s = s + x8/f; // Add next term } return s; } |
We can use Java's System.currentTimeMillis() which returns the current clock time in milliseconds to time these two functions. Here are the running times for various values of n:
| n | f_alg1 | f_alg2 |
| 0 | 0 | 0 |
| 10000 | 92 | 3 |
| 20000 | 353 | 6 |
| 30000 | 788 | 7 |
| 40000 | 1394 | 10 |
| 50000 | 2172 | 12 |
Both get slower as n increases, but algorithm 1 gets worse faster. Why does algorithm 1 suck?
What happens if we run one of them on a different computer? Will we get the same times? Probably not-- processor speed, operating system, memory, and a host of other things impact how much time it takes to execute a task.
What we need, then, is a machine- and scenario-independent way to describe how much work it takes to run an algorithm. One thig that doesn't change from computer to computer is the number of basic steps in the algorithm, (this is a big handwave, but generally true) so lets measure steps.
We are going to vaguely describe a "step" as a primitive operation, such as:
- assignment
- method calls
- arithmetic operation
- comparison
- array index
- dereferencing a pointer
- returning
It is important to note that this has a great degree of flexibility: one person's basic step is another person's multiple steps. To a hardware person, taking the not of a boolean is one step because it involves one bit. Adding two 32-bit numbers is 32 (or more!) steps. We won't be quite so anal.
So, when we determine how long an algorthim takes, we count the number of steps. More steps == more work == slower algorithm, regardless of the machine.
What does the following code do, and how many steps does it take?
max = a[0]; | 2 steps: array index and assignment |
i =1 | 1 step: assignment |
while (i < n) | n comparisons, plus (n-1) times through the loop |
if (a[i] > max) | 2: array and comparison |
max = a[i] | 2, but it doesn't always happen! |
i++ | 2: i=i+1; |
return max; | 1: we'll count calls and returns as 1. |
This gives us: 2 + 1 + n + (n - 1)*(4 or 6) + 1.
Now, we're not quite done. In the code above, there are circumstances under which it will run faster than others. For example, if the array is in descending order, the code within the if statement is never run at all. If it is in ascending order, it will be run every time through the loop. We call these "best case" and "worst case" run times. We can also consider and talk about the "average" run time.
The best case is usually way too optimistic (most algorithms have best case times that are far away from average). Average case is often hard, because we need probabilistic analysis. In many cases, a worst case bound is a good estimate. That way we can always say, "no matter what, this program will take no more than this many steps."
Comparing algorithms with bounds
In the program above, the nuber of steps it takes depends on the size of the array, the n. What we really want is a function that takes the size of the array as input, and returns the number of steps as the output. Note that is what we developed above. (2 + 1 + n + (n - 1)*(4 or 6) + 1).
So when we're comparing algorithms, we don't compare numbers, but functions. But how do we compare functions? Which function below is "bigger"?
Well, sometimes one, and sometimes the other. So what we'll do is talk about one function dominating the other. We'll say that one function dominates another if it is always bigger than the other for every x value (for example \(f(x) = x^2\) dominates \(f(x)=-x^2\)). Neither of the functions in the graph above dominates the other.
One function can dominate another in a particular range as well. \(f(x)=1\) dominates \(f(x) = x^2\) in the range \([0;1]\).
Now, when input sizes are small, everything is fast. Differences in run time are most relevant when there is a lot of data to process. So, we colloquially ask "When n is large, does one dominate the other?" Mathematically, we ask if there exists an \(n_0\geq 1\) such that for all \(n>n_0\), \(f(n) > g(n)\) (or \(g(n) > f(n)\)).

In this graph, \(f(x)=3x^2-8\) dominates \(f(x)=x^2\) everywhere to the right of \(x=2\).
How can we tell if one will dominate the other for large n?
It turns out it's really easy. It is all about growth rate. Take a look at this graph of \(f(n)= n^2\) vs. \(g(n) = 3n+4\).
What happens if we change the constants in \(g(n)\)? The line moves, but it doesn't change shape, and there's always some point at which f(n) passes it. Here for example is \(f(n)= n^2\) vs. \(g(n) = 30n+40\).
As soon as we start caring about really big n, it doesn't matter what the constant coefficients are, \(f(x)\) will always find a point at which it will dominate \(g(x)\) everywhere to the right. In the following table, \(f(n)\) will dominate \(g(n)\) for some large enough \(n\):
| \(f(n)\) | \(g(n)\) |
| \(n^2\) | \(n\) |
| \(n^2-n\) | \(5n\) |
| \(\frac{1}{10^{10}}n^2-10^{10} n - 10^{30}\) | \(10^{10} n + 10^{300}\) |
Big-O Notation
We talk about the runtime of algorithms by categorizing them into sets of functions, based entirely on their largest growth rate.
The mathematical definition: if two functions \(f(n)\) and \(f'(n)\) are both eventually dominated by the same function \(c*g(n)\), where \(c\) is any positive coefficient, then \(f(n)\) and \(f'(n)\) are both in the same set, \(O(g(n))\) (pronounced "Oh of g(n)"). \(f(n) \in O(g(n))\) and \(f'(n) \in O(g(n))\). Since the constant coefficients don't matter we say that \(g(n)\) eventually dominates \(f(n)\) if there is ANY constant we can multiply by \(g(n)\) so that it eventually dominates \(f(n)\). \(f(n)=5n+3\) is obviously \(O(n^2)\). It is also \(O(n)\), since we could multiply \(n\) by 10, and then it would eventually dominate \(f(n)\).
The formal definition: For a function \(f(n)\), if there exists a \(c\) and \(n_0\) such that \(c > 0\) and \(n_0 \geq 1\), and for all \(n > n_0, f(n) \leq cg(n)\), then \(f(n) \in O(g(n))\).
While \(f(n)=5n+10 \in O(n^2)\), it is also \(O(5n)\). \(O(5n)\) is preferred over \(O(n^2)\) because it is closer to \(f(n)\). \(f(n)\) is ALSO \(\in O(n)\). \(O(n)\) is preferred over \(O(5n)\) because the constant coefficients don't matter, so we leave them off.