Reading

Required: These notes!
Recommended: Java in a Nutshell, TO APPEAR

Overview

We've done a few things with I/O in Java, but we've never looked at it systematically. So ... let's do that now! Not only is it something you should know as a Java programmer, but it gives us a nice case study in object oriented design. So we're going to first try to understand the problem: what requirements do we really have of an I/O system? Then we'll see how the Java IO package's oject oriented design meets those goals.

Reading ... What's the problem?

I/O (input and output) is kind of what programs are all about. If we couldn't instruct the program as to our intentions or we couldn't somehow perceive the results produced by the program, what would be the point in running it? In this lesson we look at byte and character I/O — I/O concerning files, buffers in memory, network connections, things like that. GUI I/O will be covered later.

Designing a system for handling I/O is a daunting problem for language/library/API developers. These operations are so ubiquitous in programs that getting it wrong means making pretty much every program anyone ever writes more difficult, and getting it right means making pretty much every program anyone ever writes easier. So, let's consider some disreable properties, some design goals, regarding reading, i.e. the "input" half of I/O. In particular, we'll look at stream-oriented input and output.

  1. bytes or chars or tokens? Fundamentally, all data in the modern computing world is byte-oriented, and oftentimes we want to read data that way. On the other hand, often we want to read textual data - i.e. data that is character-oriented. In C/C++, where char's and bytes are synonymous, we don't need to distinguish between the two. In Java, however, where characters are unicode and not generally single-byte values, we very much have to distinguish beteween the two. Sometimes the programmer will want to read bytes, sometimes characters, and sometimes tokens - like the textual representations of doubles or booleans.
  2. treat multiple sources of input in a uniform way Input can come from many different sources: files, network connections, standard in, strings, arrays of bytes, arrays of characters, pipes. etc. We saw in IC210 that writing functions that took an istream argument, which could equally take a inputfilestream or cin, was a powerfull idea. This is just an example of the power of treating multiple sources of input in a uniform way, i.e. with one programming construct that applies to many different actual sources of input.
  3. flexibility to add new operations, improve efficiency, or modify input streams on the fly With something as universally used as stream-oriented I/O, there's no way to design a system that will meet everyone's needs all of the time. Therefore, the system needs to provide programmers the flexibility to change how things work without abandoning or breaking the whole system.

Bytes, Chars or Tokens?

The Java I/O design takes into account issue 1 from the above in a typically OOP way, by having three separate classes of input objects.

class InputStream class Reader class Scanner
InputStream is for byte-based input. Reader is for character-based input. Scanner is for token-based input.
int read(); // should be byte, but uses int
            // to return -1 to signal EOF
void read(byte[] b, int off, int len);
void close();
int read(); // should be char, but uses int
            // to return -1 to signal EOF
void read(char[] b, int off, int len);
void close();
String next();
int nextInt();
double nextDouble();
...

As a programmer, you have to figure out how you want to read data in. Would you like to read a byte or chunk of bytes at a time? Then you want an InputStream object. Would you like to read a character or chunk of characters at a time? Then you want a Reader object. Would you like to read a token at a time, e.g. a double at a time or String at a time or boolean at a time? Then you want a Scanner object.

Treat multiple sources of input in a uniform way

Here's how the Java IO system uses OOP to allow multiple sources of input to be treated in a uniform way. First of all, the InputStream and Reader classes are abstract. They are the roots of class hierarchies. Specific sources of bytes of data give rise to classes that extend InputStream. For example, in the API we have:
            InputStream
               /  \
              /    \
             /      \
FileInputStream   ByteArrayInputStream
So if, for example, you want to write code to search for the bytes 0x7F 0x45 0x4C 0x46, which indicates the beginning of a Unix executable, you would write your method to take an InputStream argument. That way the same method works for both files and byte arrays.

Similarly, specific sources of characters give rise to classes that extend Reader. For example, in the API we have:

              Reader_____
             /   \       \____
            /     \           \
           /       \           \
StringReader  CharArrayReader  InputStreamReader

So, if you wanted to write code to count the number of non-alphabetical characters in text, you would write that method to take a Reader argument. That way the same method works for Strings, arrays of chars and (and this is interesting!) any InputStream — because we can make an InputStream the source of characters for a Reader via the InputStreamReader class! If you look at the API documentation for InputStreamReader, the InputStreamReader constructors take an InputStream as a parameter.

Two important points here: 1. Technically there is no Scanner constructor that takes a Reader as a parameter. Instead, it takes an object that implements the Readable interface. However, Reader implements Readable, so this constructor works with Readers, but is in fact a bit more general than that. 2. The Scanner constructor that takes an InputStream as an argument is actually just a convenience thing. You only really need the constructor that takes a Reader as an argument. Why is that enough?
Finally, we have our good friend the Scanner.
             Scanner
Class Scanner has constructors that take InputStreams or Readers as arguments. So if you wanted to write code to do something like add all the integers in some text, you would write that method to take a Scanner as an argument. That way it would work with files, byte arrays, char arrays or Strings. Putting this together, if you have a file whose name is "data.txt" and you want to read in tokens from it (e.g. ints and double and booleans and strings), you would create a scanner for it like this:
Scanner sc = new Scanner(new InputStreamReader(new FileInputStream("data.txt")));
                                               \_____________________________/
                                                 an InputStream whose bytes
                                                 come from file data.txt
                         \____________________________________________________/
                          a Reader whose chars come from the bytes in data.txt
             \_________________________________________________________________/   
              a Scanner whose tokens are made up of chars whose bytes come from data.txt
There are some shortcuts to all of this. So-called "convenience methods" to make, for example, a Reader directly from a file name.

The following program illustrates what this flexiblity buys you. It defines three methods, findELF, countNonAlpha and sumInts, that process inputs as streams of bytes, chars, and tokens, respectively. What the program shows is how flexibly each method can be called on a variatey of different input sources. For example, sumInts can be called with the ultimate source of data being a file, a byte array, a string or a character array ... and, of course, we could have called it on stdin as well! The countNonAlpha method provides an interesting example. To highlight the difference between bytes and chars in Java, try the input file

in1 ← save this, don't view in the browser
	
... is interesting because it contains a non-ascii unicode character (a heart). The result is that it is a seven-byte file that contains only four characters.

Flexibility to add new operations, improve efficiency, or modify input streams on the fly

Finally we get to the third and last of our design goals: the flexibility to add new operations, improve efficiency, or modify input streams on the fly. When we want to modify or extend functionality in OOP, what do we always do? We use inheritance. I'll give you two examples of where this is done in the Java API, one to modify behavior and one to add functionality.

The first is the class BufferedReader. The issue BufferedReader addresses is this: when a call to read() is made for a Reader that has, for example, a file as its ultimate source for data, that call results at some lower level in a system call to fetch that byte. At this low level, however, fetching a byte-at-a-time is tremendously inefficeint. It typically takes as much time to fetch something like 1024 or 2048 bytes as it does a single byte. Therefore, it would be nice to have a variant of Reader that would fetch, say, 1024 bytes into a buffer the first time read() is called, then dole those out one-at-a-time for each read() call until the buffer is emtpied. Only then would it go back to fetch more bytes from the lower-level — another chunk of 1024. That's what the class BufferedReader does. What's kind of funny is that it does it as a wrapper around another Reader. In other words, BufferedReader is a Reader that takes a Reader and wraps it in this buffering scheme. So for example, if you had a file "data.txt" to read tokens (e.g. integers) from, and you were worried about performance, you might create your Scanner like this:

Scanner sc1 = new Scanner(new BufferedReader(new InputStreamReader(new FileInputStream("data.txt"))));
The BufferedReader will make calls like read(buff,0,1024) to its underlying InputStreamReader, which will make a call like read(buff,0,1024) to its underlying FileInputStream, which will result in a lower-level system call to fetch the next 2024 bytes from the file. The object oriented design of Java's I/O package makes this possible. By deriving BufferedReader from Reader, the Java authors provide modified functionality that can be used anywhere a regular Reader can be used.

The second example to look at is the class LineNumberReader, which is much easier to explain. Sometimes you want to be able to ask what line you're on as you read input. That's an extra piece of functionality you might wish that a Reader had. The class LineNumberReader extends BufferedReader to provide just that one extra piece of functionality. So now we could redo our Scanner defintion like this:

LineNumberReader r;
Scanner sc2 = new Scanner(r = new LineNumberReader(new InputStreamReader(new FileInputStream("data.txt"))));
... and whenever you want to know what line number you're on you can call r.getLineNumber(). Once again, the object oriented design of Java's I/O package makes this possible. By deriving LineNumberReader from BufferedReader which is derived from Reader, the Java authors provide new functionality that can be used anywhere a regular Reader can be used.

And then ...

And then we should probably look at the Errors and Exceptions that all these methods from all these classes throw ... but we won't. You can do that on your own!

A hort note on output

Output is fundamentally a bit easier than input. Why? Because with output your code knows what it wants to write, so it controls the outgoing bytes. With input, your code doesn't know what's coming. It must react and adapt to the incoming bytes. So we're not going to describe output in much detail.

Similar to the input case, we have two separate hierarchies for output: the hierarchy rooted at OutputStream, which is for byte-oriented output, and the hierarchy rooted at Writer, which is for character-oriented output. The distinction is a bit blurrier than for the input case, because the class PrintStream, which is derived from OutputStream, provided methods for writing int's, double's, String's, etc., as does PrintWriter, which is derived from Writer. The distinction has to do with how characters are encoded as bytes: PrintStream using the JVM's default encoding and PrintWriter allowing the programmer to independently specify that encoding. These are distinctions we won't go into here. Note, however, that System.out and System.err are both PrintStream objects.