IC210: Sorting and Searching I

Reading

Section 7.3 of Problem Solving with C++.

Indices rather than elements

Often with arrays it's more powerful to return the index of an element with a certain property, rather than the value of the element. For example, if string *NAME is an array of names of contest participants, and int *SCORE is an array of scores, such that participant NAME[i] has score SCORE[i], then knowing the value of the largest element in SCORE won't tell me who the winner is, but knowing the index of the largest element in SCORE will.

So, with this in mind, let's consider writing a function indexOfMax that will return the index of the element with the maximum value in an array A of N objects of type int.

int indexOfMax(int *A, int N)
{
  int imax = 0;
  for(int i = 1; i < N; i++)
    if (A[imax] < A[i])
      imax = i;
  return imax;
}

Very simple function! Now, using our above example, the winner of our contest is NAME[indexOfMax(SCORE,N)].

Sorting using Selection Sort

Consider the following function that uses our indexOfMax function:

void mystery(int *A, int N)
{
  for(int length = N; length > 1; length--)
  {
    int k = indexOfMax(A,length);
    int temp = A[length-1];
    A[length-1] = A[k];
    A[k] = temp;
  }
}

What does the mystery function do? Well, it starts by swapping the largest element in the array with the last element of the array, and then pretends the last array slot isn't there any more. Hopefully, you see that this puts the largest at the back of the array, the next largest in the second to last spot, etc. util the array is in sorted order! This sorting algorithm is known as Selection Sort, because it selects the largest of the remaining elements and puts it in its proper spot in the sorted array. If you define the swap function (as in Class 18), you can write a particularly succinct version of this function:

void selectionsort(int *A, int N)
{
  for(int length = N; length > 1; length--)
    swap(A[indexOfMax(A,length)],A[length-1]);
}

This is actually a subtle and important function for the way it uses pass-by-reference on array elements. Make sure you understand how it works!

Comparison Functions and Sorting

It is important to note that there is no one God-given "sorted order". We get different orders depending on how we compare values. To make this a litle more explicit, let's imagine our indexOfMax and selectionsort functions this way:

int indexOfMax(int *A, int N)
{
  int imax = 0, i;
  for(i = 1; i < N; i++)
    if (before(A[imax],A[i]))
      imax = i;
  return imax;
}

void selectionsort(int *A, int N)
{
  for(int length = N; length > 1; length--)
    swap(A[indexOfMax(A,length)],A[length-1]);
}

... where the comparison is now done by a function before rather than A[imax] < A[i]. Now, if before is defined like this:

bool before(int a, int b)
{
  return a < b;
}

it tells you that you want a to come before b if a is smaller, i.e. we've got our same old smallest-to-largest sorted order. If, on the other hand, before is defined like this:

bool before(int a, int b)
{
  return a > b;
}

the we'll get largest-to-smallest sorted order. Think of before as a comes before predicate. I use before to tell selection sort how to determine if one value should come before another element. Perhaps I want to sort numbers by increasing absolute value, with the negative number coming first if we have opposite pairs in the array. I define before like this:

bool before(int a, int b)
{
  if (a != -b)
    return abs(a) < abs(b);
  else if (a <= 0)
    return true;
  else
    return false;
}

Now we get an ordering by smallest-to-largest absolute values, with ties (i.e elements with equal absolute value and opposite signs) putting the positives to the back. For example,

45  32  -12  -32 0 -18 6

would be "sorted" using this before function as:

0  6  -12  -18 -32 32 45

This is an important concept. The sorted order produced by a sorting algorithm like Selection Sort depends on the definition of your before function - the same algorithm works for any comparison function! In fact, when we get more advanced, we'll even pass the before function as a parameter to selection sort! Of course, we don't know how to do that yet!

Note: Most places you see selection sort they don't break indexOfMax and swap out into their own functions. Instead, they incorporate the whole thing inside the loop, like this:

void selectionsort(int *A, int N)
{
  for(int length = N; length > 1; length--)
  {
    // Find imax, the index of the largest
    int imax = 0, i;
    for(i = 1; i < length; i++)
      if (before(A[imax],A[i]))
	imax = i;
    // Swap A[imax] & the last element
    int temp = A[imax];
    A[imax] = A[length - 1];
    A[length - 1] = temp;
  }
}

The Same Sorting Algorithm Works on Arrays of Any Type!

We might decide that we want to sort arrays of objects of type string rather than int. What has to change in our selection sort algorithm? Not much! As you can see from either this program (separate functions) or this program (all in one), nothing really changes except the names of the types we're working with!

Now, by way of excercise, consider this problem: When I input strings with capital letters, capitals are counted as coming before lower-case. For example, here's a run of my program:

Enter number of strings: 4
Enter strings: Xena adroit lefty xylaphone
Xena
adroit
lefty
xylaphone

Quite possibly, this is not what I want. How can I change the program so that capitalization is not taken into account when sorting, although when I print the strings they should have their original capitalization? To make it simple, let's assume that the only possible capitals would be the first letter. Hopefully you see by now that all this requires is changeing the fuunction before.

old version of before

new version of before

bool before(string a, string b)
{
  return a < b;
}

bool before(string a, string b)
{
  // Remember we pass by value!
  if ('A' <= a[0] && a[0] <= 'Z')
    a[0] = a[0] + ('a' - 'A');
  if ('A' <= b[0] && b[0] <= 'Z')
    b[0] = b[0] + ('a' - 'A');
  return a < b;
}

What we did here was to simply make the 1st letter lower-case if it was capital for strings a and b. Note that this is okay because a and b are passed by value. If they were passed by reference, we wouldn't be able to print out with capitalization. Using this version of the program, we get:

Enter number of strings: 4
Enter strings: Xena adroit lefty xylaphone
adroit
lefty
Xena
xylaphone

Sorting Arrays of Arrays, and what "swap" does in this case!

Since we can sort arrays of any type of object, how about sorting arrays of arrays? For example, let's say that we have a data file like data.txt, where each row consists of three numbers defining a 3-dimensional vector. I want to read in the values in a such a data file and print out the vectors in increasing order of magnitude (i.e. sqrt(x^2 + y^2 + z^2)). Hopefully you see that this doesn't require too many changes to the above program. One key thing to see however, is what swap(a,b) does when a and b are pointers. Hopefully this picture will give you an idea:

The pointers change, but the blocks of memory that really constitute the arrays remain unchanged!

Search

Searching for values in arrays is another fundamental operation. The basic format is search(A,N,x), where we search for value x in the array A of N elements, and return the index of an element of A that matches the value x. If no such element is found, an index of N may be returned and, since it is not a valid index, the caller of the function can determine that no match was found. For example, to search in an array of objects of type string, I'd define the following function:

int search(string *A, int N, string x)
{
  int i = 0;
  while(i < N && A[i] != x)
    i++;
  return i;
}

As with before in sorting, we may imagine different ways in which elements are considered to match. The above search, for instance, does not consider two strings equal if they differ if one is in capitals and the other in lower case. We can address in the same way as before, by using a function match instead of == to determine whether two objects match:

int search(string *A, int N, string x)
{
  int i = 0;
  while(i < N && !match(A[i],x))
    i++;
  return i;
}

... and then changing the match function to correspond to different ideas of what it means for two objects to match. Meanwhile, the search function is always the same! [Note how short-circuit evaluation is an integral part of search. Without it, the function would not work!]

Tip: It may not seem natural or obvious to return an int equal to the size of the array when a search fails. Why not -1? Why not size-of-the-array + 1? Certainly both of those work, as do many other options. However, experience has shown that returning the array size on fail is a nice way to do it and this approach has become pretty standard. The fact that it is pretty standard is really the most compelling argument of all. In the absence of compelling reasons to do otherwise, shouldn't we write our code to look like what people expect and are used to seeing?

Problems

Write a program that reads in a list of 10 names (first name followed by last name) from the file names.txt and prints them out in the usual order - i.e. alphabetically by last name, using first names to break ties. Note: You'll want to represent a person's name as an array of length 2.
Here's my solution
Write a program that reads in a list of 10 names (first name followed by last name) from the file names.txt, then repeatedly queries the user for a first name and prints out all names with first-name matching the query value. A run of the program might look like this:
```
~/$ ./prog
Name (or quit): John
John Adams
John Tyler
Name (or quit): James
James Madison
James Monroe
Name (or quit): Steve
Name (or quit): George
George Washington
Name (or quit): quit
```
Here's my solution