Text I/O vs. Binary I/O ... we're sticking to text!

In life, file I/O can be character-based or byte-based. In C this distinction doesn't really exist, because chars are bytes. But in Python, characters and bytes are different things. Like this:
>>> s = "y÷z" # this defines s as a string of characters
>>> len(s)
3
>>> t = s.encode("utf-8") # now we convert to a string of bytes
>>> t
b'y\xc3\xb7z'
>>> len(t)
4
So you need to make a decision when you open a file for reading or for writing: do you want to deal with this as text (which means character-based) or as binary (which means byte-based). When you open a file with the open function, the second argument is the "mode", which is "r" or "w" for reading or writing as text, and "rb" or "wb" for reading or writing as binary:
fh = open("foo","r")  # open text stream for reading 
fh = open("foo","w")  # open text stream for writing
fh = open("foo","rb") # open binary stream for reading 
fh = open("foo","wb") # open binary stream for writing

Note: we're sticking with text-based I/O for this lesson!

The most basic pattern for reading through a file a line at a time

A very common task is reading through a file a line at a time. Text streams are iterable (meaning you can do a for-loop on them, or a list comprehension on them). When you iterate over them, each iteration gives you the next line, including the trailing \n. If s is a string, you can call s.strip() to strip the whitespace from either end of s (producing a new string, since strings are immutable!).
Note: for the rest of this section, we will be dealing with these two files:

For our first examples, we'll just look at the basic Python pattern for reading through a file a line at a time, and we'll apply it to print the contents of foo.txt, but indented by five spaces. Exciting, right?

Finally, of a text stream is iterable, we can do a list comprehension over it, right? So let's take the entire contents of the file, and store it into a list of strings, where each list element is a line of the file (with newlines stripped out). Putting the lines into a list means we can do fun things, like print the lines out in reverse order.

ACTIVITY
  1. Write a python script that takes two files as command-line arguments (you may assume they have the same number of lines, though it'd be nice if, ultimately, you could remove that requirement), and prints them out side-by-side.

Text output streams: the write method

If there are text input streams, then surely there must be text output streams, right? Of course! You can use open withe mode "w" to get a text output stream attached to a file, as the example below shows. The main method for text output streams is write, which takes a string and writes it to the stream. The return value is the number of characters written.
>>> import io
>>> hout = open("tmp.txt","w")
>>> hout.write("This is\na test!\n")
16
>>> hout.close()
>>> hin = open("tmp.txt","r")
>>> hin.readline()
'This is\n'
>>> hin.readline()
'a test!\n'
If you import the sys module, sys.stdin is an input text stream tied to

Not all streams are files: the stdin and stdout streams

If you import the sys module, sys.stdin is an input text stream tied to standard in, and sys.stdout is an output text stream tied to standard out.

Text input streams: readline and read

In the examples from the previous section, we are using iterators (even if only behind the scenes) to read from the input stream. Input streams have methods that let us read directly from them: readline and read.

If fh is a text input stream, fh.readline() ... well, it reads the next line! That's up to and including the \n, by the way. In the script below, we use readline to skip the first line of foo.txt.

If fh is a text input stream, fh.read(n) reads the next n characters of input (or as many as remain, if fewer than n characters are left) and returns them as a string. Important! if you are already at the end of file, an empty string is returned. The script below uses fh.read(1) to go through the input stream character-by-character.

Redacterator!

To the right is a cute little program does "redacting", and is based on input/output text streams. The rules are:
  1. regular text is output unchanged
  2. text wrapped in < ... > is "redacted", meaning that each character inside is replaced with a box that covers up the actual text, excluding the <s and >s themselves
  3. the \ character acts as an escape, so whatever comes after it is written out unaltered, even if it is a < or >
  4. inside redacted text, an <, or an escaped <, > or \ character, is simply treated as a character to be redacted, not as a special marker

The code, by the way, is simply an implementation of the following transducer:

ACTIVITY
Use redactor.py without modification (this means your script will be a new file that just does an import to get access to my redactor code) to write a program that takes two command line arguments, the first is the input filename, the second the output file name. If - is used for the first argument, use stdin. If - is used for the second argument, use stdout.

Going further: add a -s that shows the redacted text, but without the <...>s that define redacted blocks or the escape characters. This will require making some changes to redactor.py.


Christopher W Brown