SI 204 Spring 2017 / Labs


Lab 03: Books

This lab will exercise your skill with loops and reading files. Some useful references will be the Unit 3 notes and, as always, the C functions reference for this class.

Submit in the normal way, being sure to put your files in a directory called lab03.

In this lab, you’re going to do some basic file analysis to some famous books. Download this tarball to a lab03 directory, and untar it with the command tar xzf books.tgz. That will create a directory books that has five text files in it, all freely available books from Project Gutenberg.

1 Word Count

Each of the files in the books tarball you downloaded is in a similar format. Open up one of the text files such as alice.txt in a text editor and check it out; the format is roughly like this:

(... header information ...)

*** START OF THIS PROJECT GUTENBERG EBOOK ALICE’S ADVENTURES IN WONDERLAND ***


ALICE’S ADVENTURES IN WONDERLAND

Lewis Carroll

THE MILLENNIUM FULCRUM EDITION 3.0


CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the

(..the rest of the book...)

*** END OF THIS PROJECT GUTENBERG EBOOK ALICE’S ADVENTURES IN WONDERLAND ***

(... footer information ...)

Of course, we mostly care about the actual book! The key is that three asterisks *** in a row appear twice in the line right before the start of the book, and twice in the line right after the end of the book, and nowhere in between. So, for any of these books, you want to examine the words that appear between the second and third occurrence of the string "***".

Write a program called count.c that reads in a filename for a book, and reports the number of words in the book, ignoring the header and footer information.

Example:

roche@ubuntu$ ./count
Enter a filename: books/alice.txt
Word count: 26460

2 Average Word Length

Now write a program to report the average number of letters (characters) per word. Copy your program from the previous part to a new file called wlen.c.

In C, you can get the length of a single word by using the strlen function. For example, to read in a word from the user and print out its length, you would do:

cstring s;
readstring(s, stdin);
int length_of_s = strlen(s);
writenum(length_of_s, stdout);

Here’s how your program should work for this part:

roche@ubuntu$ ./wlen
Enter a filename: books/sherlock.txt
Average word length: 4.35476
roche@ubuntu$ ./wlen
Enter a filename: books/narrative.txt
Average word length: 4.4941

Show your instructor your progress at this point and don’t forget to submit to save your work.

3 Sentence Length

The rest of this lab is optional for your enrichment and enjoyment. Be sure to work on this only after getting everything else perfect. You should still submit this work, but make sure it doesn't break any of the parts you already had working correctly!

Next, write a program to report the average number of words per sentence. Copy your program from the previous part to a new file called slen.c.

We’ll assume sentences always end with one of the following characters: .!?, and that only ends of sentences contain those characters (this may not be entirely accurate, but is good enough for our purposes). So basically you are looking for words that contain one of those three characters.

In C, the command strchr can be used to search for a particular character in a given string. This function returns 0 if the character is not in the string, and otherwise it returns a nonzero value. Remembering that in C 0 is treated as false and anything nonzero is treated as true, we can use strchr like this:

cstring s;
readstring(s, stdin);
if (strchr(s, '!')) {
  fputs("You seem excited...\n", stdout);
}

Important: The last sentence in the book might not end with a word that has a ', !, or ?. Your program should account for this possibility and still compute the correct average sentence length!

Here are two example runs:

roche@ubuntu$ ./slen
Enter a filename: books/metamorphosis.txt
Average sentence length: 26.5156
dandesk:~/gits/204/gr/lab/03/sol$ ./slen
Enter a filename: books/huckfinn.txt
Average sentence length: 17.9725

Don’t forget to submit and show off your program to your professor if you finish this part during the lab time.

4 Multiple Files

Now let’s put it all together and write a program to produce all three statistics, looping over multiple files. Copy your program from the previous part to a new file called multiple.c.

This program will report the word count, average word length, and average sentence length, for a series of files. Your program should keep asking for more file names until the user enters quit.

(Note that the counts are for that one book only; they are not cumulative over all the books entered so far.)

Here’s an example run:

roche@ubuntu$ ./multiple
Enter a filename: books/sherlock.txt
Word count: 104519
Average word length: 4.35476
Average sentence length: 14.3511

Enter a filename: books/metamorphosis.txt
Word count: 22114
Average word length: 4.3652
Average sentence length: 26.5156

Enter a filename: books/narrative.txt
Word count: 40773
Average word length: 4.4941
Average sentence length: 18.4077

Enter a filename: quit