SI342: Regular Expressions in Perl

Reading from stdin a line at a time

For our exercises today we will want to read from stdin a line at a time. This way we can read from the terminal or, using "<" for bash input redirection, we can read from a file. Here's a short program that illustrates how we can read through stdin a line at a time:

ex0.py	input0.txt	run of program
		$ ./ex0.py < input0.txt There were 2 blank lines!

TODO!

Copy this program and try running it!

A simple Python program using regular expressions

Python's regular expression facilities are in the re module, which you need to "import" before using. The re module has a number of useful functions. We will start with re.search(regex,str), which searches the input string str for a substring matching the regular expression regex. What follows is a simple Perl program that uses regular expressions to determine whether a line contains either b0b or b1b.

ex1.py	run of program
	$ ./ex1.py Blah,b0bfa MATCH! Foob10b0x NO MATCH! how about b 1 b ? NO MATCH! howaboutb1b? MATCH!

Note1: A Python string prefaced with r is a "raw string", meaning that there are no "escapes" or anything. So 'x\ny' is an x followed by a newline followed by y, but r'x\ny' is an x followed by a backslash followed by n followed by y. It's easiest to write your regular expressions with raw strings for reasons that will become clear later.

Note2: If re.search( ) successfully finds a match, it returns a "match object". We'll talk about that later. If it doesn't find a match, re.search( ) returns None, which is Python's version of "null". And if you try to use a refernce in a boolean context, like an if-condition, None is false. and anything non-None is true. That's why that if-statement in the program works!

TODO!

Copy this program and run it a few times to see how it works.
Is the line "axorq8,e3k0ka1b0bkca4bb2b" a match? Why/Why not?
Modify this program so that instead of a 0 or 1 in between the b's you can have any digit from 0 up to 9?

Finding out what matched

Suppose that in the above example we wanted not only to know whether there was a match with b(0|1)b, but also to know what matched the (0|1) part -- i.e. did we get a zero or a one in there? The "match object", i.e. mres in our program, is like an array storing the string that matched a parenthesized sub-part of a regular expression at index i, where the left parenthesis around the subpart is the ith left parenthesis of the whole regular expression. In the above, there is only one open paren, so mres[1] is the string that matched the (0|1).

ex2.py	run of program
	$ ./ex2.py theb0bdog MATCH! Bit was 0 likes dog b1scuits NO MATCH! before b1bbles MATCH! Bit was 1

TODO!

Copy this program and run it a few times to see how it works.
What do you get for "and another thing b3b is that", and what do you get for "a0abcd3c1cb1bxac"?

Finding out what matched: Part II

When you have a complicated regular expression, with lots of things in ( )s you have to count opening parentheses to determine what index i to use in mres[i]. In the following example, we have two parenthesized subexpressions, each matching a different part of the string.

ex3.py	run of program
	$ ./ex3.py we match <26A> of course MATCH! Bit was (26,A) but not <25AB>, naturally NO MATCH!

Copy this program and try it on input like this:

The secret key <456 B> is
hidden <B678> in <007B> this 
document somewhere 56A but you'll never
find <22B it.

Sometimes parens are nested and things get complicated. For example, consider the expression E = x(letter(a|b)+|number(0|1)+)y. This is matched by strings like xletteraaby and xnumber1010y. In the first case, mres[1] = "letteraab", mres[2] = "aab", and mres[3] = "". In the second case, mres[1] = "number1010", mres[2] = "", and mres[3] = "1010". Do you see why?

Note: It's okay to put redundant parentheses around parts of your regular expression just so you can use a mres[i] to refer back to whatever matched them.

Python Regular Expression Extras

\d this matches a digit, so the regular expression r'(a|b)\d' matches a0,b0,a1,b1,a2,b2,...,a9,b9.
? this matches 0 or 1 occurence of whatever preceeded it. So, for example, the regular expression r'1?\d' matches all the numbers from zero up to 19.
[x-y] Ranges (as for egrep). For example, the regular expression r'[A-D]|F' matches the valid letter grades, A, B, C, D and F.
. (dot) this matches any character. So, for example, if you want to match whatever comes between < and >, you'd use the regular expression r'<(.*)>' and mres[1] would give you whatever came between the < >s.

Problem 1

Write a that will print out all the "SI" computer science class numbers appearing in the file (assume no more than one per line). In other words, we want all class numbers that are "SI" followed by three digits (note: a \d matches a single digit), possibly followed by one capital letter. For example, SI204, SI486A and SI311 are all "SI" computer science courses. on the other hand, IS221, SI32, and SI496r are not. If you give you program the input file

This semester SI433 is running for
the last time take it while you can.  
EM300 is running as always, but don't
try taking it out of sequence.  We
got two OSI2 people, one teaching
SI204 (or is it IC210?), the other 
teaching IT310 as well as SI283.  I 
don't think there is a SI496A running 
this Spring. The SI 2008 course 
offerings look like they're nailed 
down, though.

You should get output:

Problem 3

Try running your Problem 2 solution on the input:

Welcome to <b>Joes</b> home of <b>FREE REFILLS</b>!!!

What you'll probably get is:

Joes</b> home of <b>FREE REFILLS

What happened? Well, after finding the first <b> it decided to match with the last </b>. By default, when you use a * or a + Perl will try to make the match as long as possible. Here we want it to make the match as short as possible. If you use *? or +? in place of * and + you get shortest possible matches instead. Modify your Problem 2 solution so that you get

Joes

on the above input. In other words, you get the first occurrence of something delimited by <b> and </b>.

Problem 4

x A lot of times you can afford to be sloppy by writing a regular expression that accepts more than you really want, as long as the extra stuff probably won't show up in the inputs you give your program, and especially if there's an easy check that a human can ultimately give. Cosider my favorite: Meeting times of courses. Meeting times look like a string of at most 5 days of the week (M,T,W,R,F) followed by a string of at most 2 class periods (1,2,3,4,5,6,8,9,10) -- e.g. "MWF3" or "R34". While it's true that there shouldn't be duplicates of periods or days, what're the odds it'll come up in your input? Be lazy! Download the file ugly like this:

curl https://www.usna.edu/Users/cs/wcbrown/courses/F11SI340/classes/L17/input1 > ugly

It is html source code that has a boatload of meeting times in it. Write a program that will print out the meeting times it contains. If the script prints out a couple of extras ... so be it. If you program is lab.py, run it like this:

lab4.pl < ugly

That'll save you copying and pasting this big ugly file! Hint: If you want between m and n occurences of something in a regular expression, {m,n} does the trick. It's like a limited * or +. So, for example, if you wanted a number between 10 and 9999 you might use the regular expression [1-9]\d{1,3}.

A Cool Regex Program

Check out this very cool program. It scans lines to determine whether a phone number (with area code) is there. It finds the number and prints out its three componants. It's able to deal with all sorts of formats for writing phone numbers!

      (1|1\-|1\w)?               # Possible 1 or 1-
	
      ((\d\d\d)|\((\d\d\d)\))    # Area code (possibly in parens)
	  
      (\-|\s)?                   # Space or dash or nothing
	    
      (\d\d\d)                   # Block one
	      
      (\-|\s)?                   # Space or dash or nothing
		
      (\d\d\d\d)                 # Block two

Try downloading and running this program with inputs like:

my number is +1(410)677-8220.
the number 4106778220 doesn't exist.
try dialing 410 677-8220 and see
don't call (410) 677 8220 please

... or anything else you can come up with.

Homework

Reading from stdin a line at a time

A simple Python program using regular expressions

Finding out what matched

Finding out what matched: Part II

Python Regular Expression Extras

Problem 1

Problem 2

Problem 3

Problem 4

A Cool Regex Program