Project1: Sorting Algorithms
Due Feb 11 (Two weeks)
We are now going to take everything you know in Java and write a program that solves a useful problem.
Sorting algorithms are an important part of Computer Science that you will hear about repeatedly. There are two reasons for this:
- They do something useful. We use sorting algorithms all the time whether we realize it or not. Any time you see a database alphabetize a list, or a spreadsheet reorder data by one of its columns, a sorting algorithm has been called.
- They are computationally interesting. It seems like a simple concept to alphabetize a list of words. Tedious, but simple. The hard part of sorting is finding ways to sort data efficiently. This is a huge field of research in Computer Science. CS majors will revisit this topic in Algorithms class.
How do people sort a list of data?
Humans and computers think differently. Humans are good at pattern matching and slow at math. Given a series of 5-digit numbers, most humans can probably add 1-5 pairs per minute. But if I ask you to find the next number in the following series (pattern matching) most people will figure it out immediately:
1111
2111
3111
4111
?
Computers can add millions of 5-digit numbers per second, but have no innate ability to find the next number in the above pattern. You could write an AI program to try to find the pattern, but it might takes hours for the program to come up with a solution.
When you are writing something like a sorting algorithm, it is important to remember that computers think differently than you. You need to come up with algorithms that computers can run quickly and accurately, even if it is not the way a human would solve the same problem.
You can sort a short list in your head. Look at the five words below. Mentally alphabetize them.
- Zucchini
- Apple
- Pear
- Grapefruit
- Banana
Can you tell what algorithm your brain used to sort this list? Did you employ a strategy? Did you quickly filter the A and Z words to the ends? How many words did you hold in your head at a time? Do you think you could translate this algorithm to a computer program?
How do computers sort a list?
Computers can compare any two items extremely quickly, but cannot look at more than two items at the same time. A CPU can compare Apple to Zucchini, but cannot simultaneouly think about Pear. Computer Scientists and Mathematicians have worked for decades to turn this simple A-to-B comparison into fast sorting algorithms. This Wikipedia page explains about 30 of them. You do not have to read the whole page, but go ahead and skim over the tables of algorithms just to get an idea of how many different methods scientists have found. They are listed in a table that compares how much time and memory they use relative to each other. Some algorithms are more computationally-intensive than others. The goal is generally to find a method for sorting your data in the fewest clock cycles.
For Project 1, you are going to implement two different sorting algorithms: Bubble Sort and Merge Sort. You are going to run them across text files to alphabetize the words in them. And you are going to use a timer to see how long it takes to sort the same list with each algorithm.
Bubble Sort
Read this Wikipedia article on Bubble Sort. Make sure that you understand the gist of it. Watch through the animation and try to understand what the algorithm is doing. You have no chance of writing code to replicate it unless you can understand what the algorithm is doing, and why this method of sorting works.
Bubble Sort works by making multiple passes through your list of data, and reordering each pair of words. In each pass, it will make (N-1) comparison, where N is the number of items in the list. The Bubble Sort keeps making passes through the list until it finally makes a pass where nothing is changed. That means the list is completely sorted.
I recommend working throught the Bubble Sort yourself until you understand it. Write the list of five words above on five post-it notes and stick them to your desk in any order. Now start making passes through them, moving the notes as required. Swap 1&2, 2&3, 3&4, 4&5, and keep running through the list until you make one pass without having to swap any words. Once you can do this reliably, you are ready to start writing code.
Note that the Bubble Sort can sort your data in place. The only thing you do with the list is to swap two data items. This means that the algorithm can run without you having to allocate more memory every time you change the order. This is true even with the post-it notes - you can swap the position of two notes without having to move them to a new space on your desk. This greatly simplifies the data structure that you need to pull this off.
It is probably best to implement a Bubble Sort on an Array, since Arrays let you access any member of the data. e.g. you can compare array[3] and array[4] without having to pop() them and store them in different variables. You can also swap the values in array[3] and array[4] easily. This would be much harder to implement with a Queue.
The purpose of this lab is to implement the sorting algorithms your self, so you cannot use any built-in Array.sort() or similar methods.
Merge Sort
Read this Wikipedia article on Merge Sort. Once again, make sure that you understand the gist of it. Watch through the animation and try to understand what the algorithm is doing. You should not even try to start writing this code until you understand what the algorithm is doing, and why this method of sorting works.
Merge Sort works by breaking a list of data into many small lists, and then merging those lists in a sorted order. It merges two lists by comparing the first item in each list, and choosing the smallest (or largest) of the two.
Merge sort begins by breaking a list of N items of data into N separate lists of one item each.
In pass 1, it merges two lists (of length 1) into one list (of length 2). It cannot perform this sort 'in place' - it needs to allocate memory for a new list of length 2. It compares the first item in each of the small lists(using peek()), and pops the smaller of the two. It pushes this popped value to the new list. It pops the remaining item and adds it to the new list. The two old lists are now of size zero, and are deallocated. The program continues to the next pair of lists (of length 1) until they have all been processed into new lists (of length 2).
In pass 2, it merges two lists (of length 2) into one list (of length 4). It needs to allocate memory for a new list of length 4. It compares the first item in each of the small lists, and pops the smaller of the two. It then compares the first item in the small lists again, and pops the smallest of the two. It continues until both small lists are empty. The two old lists are now of size zero, and are deallocated.
Each intermediate list is sorted as it is created.
Eventually, Merge Sort will have only one list remaining. That list has been completely sorted, and is the final answer from the sorting algorithm.
Note that Merge Sort does not sort in place like Bubble Sort. Lists are continually allocated and deallocated. The original list is destroyed during the sort.
It is easiest to think of Merge Sort working on a list of data items with length is a power of two. It will actually work on data of any length. You will need to make sure that your implementation does not require the length to be a power of two. That is not hard - if there are an odd number of lists for any one pass, you can just ignore the last one. You will work it back in on the following pass.
I recommend working throught the Merge Sort yourself until you understand it. Use the same post-its notes as above. Notice that merging two lists is easier if you physically move the notes to a separate part of your desk. This is exactly what the computer does when it allocates new memory. Once you can do this reliably, you are ready to start writing code.
Queues work very well for doing Merge Sort. You use the push() and pop() functions to move data from one queue to another, and peek() to compare the first two items in your pair of queues.
We have already implemented our own version of a Queue for Lab 2, back before we knew about member functions and constructors. It should be the work of a couple minutes to make push(), peek(), and pop() into member functions of your Queue, so it's a bit cleaner to use.
Here are links to a complete Lab2 solution, in case you do not trust your own Queue implementation. You can either modify this one or use your own. You may not use the Java built-in Queue or LinkedList data structure.
The Assignment and Requirements
Write a program that reads a text file and breaks the words into an array of strings. Your program will then sort the list alphabetically with either the Bubble Sort or Merge Sort algorithm, based on what the user requests from the command-line. Your program will output the sorted list to a file, and print out a report indicating the algorithm used, the number of strings, and the time required to complete the sorting.
Sorting should be done independent of capitalization, so Apple and apple would be treated as the same word.
You will need to implement the following classes. Each must be saved in its own file:
- Project1 This is the main class that runs your program. It interacts with the WordList class. It is not allowed to interact directly
with the Queue. This class must do the following:
- Parse the command-line arguments, as described below
- Read text from a data file into an array of Strings
- Passes the array of Strings to the WordList constructor
- Request the WordList to sort the data alphabetically using either sorting algorithm
- Record the time required to run the sort
- Save the sorted data to a file
- Print a report indicating:
- the name of the file in,
- the name of the output file,
- the number of words,
- the sorting algorithm used, and
- the time required to complete the sort
- WordList This class reads an array of strings in its constructor
and stores the array as a class variable. It must have the following
components:
- A constructor that accepts an array of strings
- A local variable that stores the array of strings
- A bubbleSort() function that sorts the local array in place. This function does not accept any variables or return anything. The order of the strings in the member array is modified after it runs.
- A mergeSort() function that sorts the strings in the local array and returns a separate array. This function does not modify the order of the strings in the member array. This function does not accept any variables. It returns a new string array that has all of the original strings sorted. (NOTE - The purpose of making these two functions handle the class's array member differently is to show you that there are different ways of solving this problem. In a real application, you would expect all sorting algorithms to handle data in the same way.)
You may need to implement additional classes, such as Queue. Any classes that you implement must be stored in their own files. Storage classes such as Queue should be manipulated directly by the WordList object, not by the Project1 object.
Command-line input
You program must read three data items from the command-line. Input must be from the command-line only; you are not allowed to use a Scanner object. The following specs show how the command-line must function:
java Project1 <alg-to-use> <input-filename> <output-filename>
- <alg-to-use> must be either '-b' or '-m'. This indicates whether to use the Bubble Sort or Merge Sort algorithm.
- <input-filename> is the path the to input file. You can pass it any readable text file. Sample text files are included below.
- <output-filename> is the name of the output file to write. The file location must be writable. If the file already exists, it will be overwritten.
examples:
java Project1 -b alice.txt mysortedwords1.txt
java Project1 -m declaration.txt mysortedwords2.txt
Error-handling
You program must handle the following errors. It should print out an appropriate message following and error and exit gracefully.
- Incorrect number of command-line variables. Expecting three.
- First command-line variable must be either '-b' or '-m'.
- Second command-line variable must be a readable file. Print an error if unable to find a readable file of that name.
- Third variable must be a writable file. Print an error if unable to write to a file of that name.
Sample Data Files
Here are the data files that we will be testing with:
Sample Output
mid@zee:~/ic211/$ java Project1 -b fox.txt sortedfox.txt ========================================== Project1 Sorting Report: File in: fox.txt File out: sortedfox.txt Word count: 9 Algorithm: Bubble Sort Time: 1 ms ==========================================
Code Samples
For testing how words compare alphabetically to each other, use this command: String.compareToIgnoreCase()
The following code shows you how to read files, write files, and run a timer. We will discuss the try/catch block in a few weeks. Basically, it catches input errors such as detecting files that do not exist, or files in directories that you do not have permissions to read. This block lets the program exit gracefully if it cannot read a file. All other commands that you need have been covered in class.
import java.io.File;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.Scanner;
public class FileIO {
public static void main(String[] args) {
try {
// Read file to string array
String array[] = readFile(args[0]);
// Start timer
Long time0 = System.currentTimeMillis();
// End timer
Long time1 = System.currentTimeMillis();
System.out.printf("Timer ran for %d ms.\n", time1-time0);
// Write string array to file
writeFile(args[1], array);
}
catch (IOException ioe) {
System.out.println("ERROR: " + ioe.getMessage());
System.exit(1);
}
}
public static String[] readFile(String filename) throws IOException {
Scanner in = new Scanner(new File(filename));
Queue q = new Queue();
while (in.hasNext()){
q.push(in.next().toLowerCase());
}
return q.toArray();
}
public static void writeFile(String filename, String[] array) throws IOException {
PrintWriter pw = new PrintWriter(new File(filename));
for (int i=0; i<array.length; i++) {
pw.println(array[i]);
}
pw.close();
}
}
Q & A
If anybody asks questions that the rest of the group should hear, the answers will go here.
Q: Regarding the timer, where do you want it placed? Do you want to time the whole program or just a part of it?
A: You should time just the sorting algorithm. The purpose of the timer is to see how Merge Sort and Bubble Sort differ. In pseudo-code, the Project1 main() function should look something like this:
...
WordList w = new WordList(array);
startTimer();
w.sort();
endTimer();
generateReport()
...
Clarification
It says above in 'Command-line input' that you cannot use a Scanner object, yet the sample code uses a Scanner to read the file. To clarify - you cannot use a Scanner object to receive input from the user at a prompt. All user-input must come from the command-line. You may use a Scanner to read a file if you choose to do so.
As discussed in Lab5, you must use javadoc to document your files.