Homework

Your homework is to finish the Problems below. AND, you must look at this comic ... and this one too.

Reading from stdin a line at a time

For our exercises today we will want to read from stdin a line at a time. This way we can read from the terminal or, using "<" for bash input redirection, we can read from a file. Here's a short program that illustrates how we can read through stdin a line at a time:

ex0.py input0.txt run of program
$ ./ex0.py < input0.txt 
There were 2 blank lines!

TODO!
  1. Copy this program and try running it!

A simple Python program using regular expressions

Python's regular expression facilities are in the re module, which you need to "import" before using. The re module has a number of useful functions. We will start with re.search(regex,str), which searches the input string str for a substring matching the regular expression regex. What follows is a simple Perl program that uses regular expressions to determine whether a line contains either b0b or b1b.

ex1.py run of program
$ ./ex1.py 
Blah,b0bfa 
MATCH!
Foob10b0x 
NO MATCH!
how about b 1 b ?    
NO MATCH!
howaboutb1b? 
MATCH!

Note1: A Python string prefaced with r is a "raw string", meaning that there are no "escapes" or anything. So 'x\ny' is an x followed by a newline followed by y, but r'x\ny' is an x followed by a backslash followed by n followed by y. It's easiest to write your regular expressions with raw strings for reasons that will become clear later.

Note2: If re.search( ) successfully finds a match, it returns a "match object". We'll talk about that later. If it doesn't find a match, re.search( ) returns None, which is Python's version of "null". And if you try to use a refernce in a boolean context, like an if-condition, None is false. and anything non-None is true. That's why that if-statement in the program works!

TODO!
  1. Copy this program and run it a few times to see how it works.
  2. Is the line "axorq8,e3k0ka1b0bkca4bb2b" a match? Why/Why not?
  3. Modify this program so that instead of a 0 or 1 in between the b's you can have any digit from 0 up to 9?

Finding out what matched

Suppose that in the above example we wanted not only to know whether there was a match with b(0|1)b, but also to know what matched the (0|1) part -- i.e. did we get a zero or a one in there? The "match object", i.e. mres in our program, is like an array storing the string that matched a parenthesized sub-part of a regular expression at index i, where the left parenthesis around the subpart is the ith left parenthesis of the whole regular expression. In the above, there is only one open paren, so mres[1] is the string that matched the (0|1).

ex2.py run of program
$ ./ex2.py 
theb0bdog 
MATCH! Bit was 0
likes dog b1scuits 
NO MATCH!
before b1bbles 
MATCH! Bit was 1

TODO!
  1. Copy this program and run it a few times to see how it works.
  2. What do you get for "and another thing b3b is that", and what do you get for "a0abcd3c1cb1bxac"?

Finding out what matched: Part II

When you have a complicated regular expression, with lots of things in ( )s you have to count opening parentheses to determine what index i to use in mres[i]. In the following example, we have two parenthesized subexpressions, each matching a different part of the string.

ex3.py run of program
$ ./ex3.py 
we match <26A> of course 
MATCH! Bit was (26,A)
but not <25AB>, naturally 
NO MATCH!   

Copy this program and try it on input like this:

The secret key <456 B> is
hidden <B678> in <007B> this 
document somewhere 56A but you'll never
find <22B it.	
Sometimes parens are nested and things get complicated. For example, consider the expression E = x(letter(a|b)+|number(0|1)+)y. This is matched by strings like xletteraaby and xnumber1010y. In the first case, mres[1] = "letteraab", mres[2] = "aab", and mres[3] = "". In the second case, mres[1] = "number1010", mres[2] = "", and mres[3] = "1010". Do you see why?

Note: It's okay to put redundant parentheses around parts of your regular expression just so you can use a mres[i] to refer back to whatever matched them.

Python Regular Expression Extras

Problem 1

Write a that will print out all the "SI" computer science class numbers appearing in the file (assume no more than one per line). In other words, we want all class numbers that are "SI" followed by three digits (note: a \d matches a single digit), possibly followed by one capital letter. For example, SI204, SI486A and SI311 are all "SI" computer science courses. on the other hand, IS221, SI32, and SI496r are not. If you give you program the input file
This semester SI433 is running for
the last time take it while you can.  
EM300 is running as always, but don't
try taking it out of sequence.  We
got two OSI2 people, one teaching
SI204 (or is it IC210?), the other 
teaching IT310 as well as SI283.  I 
don't think there is a SI496A running 
this Spring. The SI 2008 course 
offerings look like they're nailed 
down, though.
You should get output:
433
204
283
496A

Problem 2

Write a program that will print out whatever part of a line falls between "<b>" and "</b>".
There <b>is</b> a time to
play and there <b>is not</b>
a reason to panic!
your program should produce output
is
is not
	

Problem 3

Try running your Problem 2 solution on the input:
Welcome to <b>Joes</b> home of <b>FREE REFILLS</b>!!!
What you'll probably get is:
Joes</b> home of <b>FREE REFILLS
What happened? Well, after finding the first <b> it decided to match with the last </b>. By default, when you use a * or a + Perl will try to make the match as long as possible. Here we want it to make the match as short as possible. If you use *? or +? in place of * and + you get shortest possible matches instead. Modify your Problem 2 solution so that you get
Joes
on the above input. In other words, you get the first occurrence of something delimited by <b> and </b>.

Problem 4

x A lot of times you can afford to be sloppy by writing a regular expression that accepts more than you really want, as long as the extra stuff probably won't show up in the inputs you give your program, and especially if there's an easy check that a human can ultimately give. Cosider my favorite: Meeting times of courses. Meeting times look like a string of at most 5 days of the week (M,T,W,R,F) followed by a string of at most 2 class periods (1,2,3,4,5,6,8,9,10) -- e.g. "MWF3" or "R34". While it's true that there shouldn't be duplicates of periods or days, what're the odds it'll come up in your input? Be lazy! Download the file ugly like this:
curl https://www.usna.edu/Users/cs/wcbrown/courses/F11SI340/classes/L17/input1 > ugly
It is html source code that has a boatload of meeting times in it. Write a program that will print out the meeting times it contains. If the script prints out a couple of extras ... so be it. If you program is lab.py, run it like this:
lab4.pl < ugly
That'll save you copying and pasting this big ugly file! Hint: If you want between m and n occurences of something in a regular expression, {m,n} does the trick. It's like a limited * or +. So, for example, if you wanted a number between 10 and 9999 you might use the regular expression [1-9]\d{1,3}.

A Cool Regex Program

Check out this very cool program. It scans lines to determine whether a phone number (with area code) is there. It finds the number and prints out its three componants. It's able to deal with all sorts of formats for writing phone numbers!

      (1|1\-|1\w)?               # Possible 1 or 1-
	
      ((\d\d\d)|\((\d\d\d)\))    # Area code (possibly in parens)
	  
      (\-|\s)?                   # Space or dash or nothing
	    
      (\d\d\d)                   # Block one
	      
      (\-|\s)?                   # Space or dash or nothing
		
      (\d\d\d\d)                 # Block two	      

Try downloading and running this program with inputs like: ... or anything else you can come up with.

Christopher W Brown