Class 19: Regular Expressions in Perl


Reading
None.
Homework
Your homework is to finish the Problems below. Turn in printouts of your 4 Perl scripts.

What is a Perl

Perl is an odd but very useful programming lanugage. We would call it a "scripting" language, because instead of writing programs that get compiled into binary's, your source code is a "script" that gets sent to a Perl interpreter to run. It's great for processing text, which makes it very useful to system administrators and web developers. Part of its power stems from its built-in regular expression utilities, which is why we're taking a look at it.

Okay, here's about as simple a Perl program as you're going to see. It looks both similar to and different from C++:
Ex0.plinput0Run of program
Ex0.pl
#!/usr/bin/perl

$count = 0; # Creates variable $count and initialize to zero

while(<>)   # Loop over each line of input
{
  if ($_ eq "\n")  # The $_ variable containins the current line
  {
    $count++;
  }
}

print "There were $count blank lines!\n";
here's the first line

here's another
and still more

but no more after ...
this!
valiant> Ex0.pl < input0
There were 2 blank lines!
Note to treat the text file Ex0.pl as a program you must first make it executable with the command chmod +x Ex0.pl.

Anything following a # is a comment in Perl. So the file starts out with #!/usr/bin/perl. Every Perl program should start with this. It's how the operating system will know to use the Perl interpreter to "run" the script. Variables in Perl start with a "$". You don't need to declare them, you can just go ahead and use them. Strings can be compared with the eq operator, and you need to surround the body of an "if" or "while" with { }s. Other than that, simply accept that "while (<>)" loops over each line of input, and that the variable "$_ inside the for-loop is a string containing the line just read.

If you want to learn Perl, you can get the O'Reilly book "Perl in a Nutshell" from Safari, a collections of tech books that we can access online for free, because the library subscribes. This includes a lot of the O'Reilly Books! Follow the above link and search for "Perl". Be sure to click the "My Books" button. What gets returned is a list of all the O'Reilly books about Perl that we get full text, online from our library's subscription. The first one is "Perl in a Nutshell", and you can find out as much about Perl as you could wish to know there.

A simple Perl program using regular expressions
The perl syntax for parsing a regular expression looks like this
string =~ /regular expression/x
The way this works, is that Perl will try to see if string contains a match with regular expression. The whole expression is true if there's a match and false otherwise. Regular expressions in Perl follow the same basic rules as in egrep, except that, because I folled the whole thing with an x, whitespace is ignored in the expression (e.g. /a(bb)*c/x is equivalent to /a (bb)* c/x). You need to use \s to match a whitespace character. What follows is a simple Perl program that uses regular expressions to determine whether a line contains either b0b or b1b. Ex1.pl
#!/usr/bin/perl
#######################################################################
# PROGRAM: This program reads input and prints "MATCH!" for each line #
# containing "b01" or "b1b", "NO MATCH!" for each line that doesn't.  #
#######################################################################

# Loop over each line of input
while(<>)
{
  # Parse line (the "$_" variable is the current line)
  $match_query = $_ =~ /b(0|1)b/x;

  # Print "MATCH!" or "NO MATCH!"
  if ($match_query) 
  {
    print "MATCH!\n";
  }
  else
  {
    print "NO MATCH!\n";
  }
}

Copy this program and run it a few times to see how it works. Is the line "axorq8,e3k0ka1b0bkca4bb2b" a match? Can you modify this program so that instead of a 0 or 1 in between the b's you can have any digit from 0 up to 9? Hint: Do it!

Finding out what matched
Suppose that in the above example we wanted not only to know whether there was a match with a(0|1)b, but also to know what matched the (0|1) part -- i.e. did we get a zero or a one in there? In Perl we can refer to the string that matched a parenthesized part of a regular expression by $n, where the left parenthesis of the expression is the nth opening parenthesis of the expression. In the above, there is only one open paren, so $1 refers to the string that matched the (0|1). This is just a variable that we can print, or test or do whatever with in Perl. Ex2.pl
#!/usr/bin/perl
#######################################################################
# PROGRAM: This program reads input and prints "MATCH!" for each line #
# containing "b01" or "b1b", and "NO MATCH!" for each line that       #
# doesn't.  When a match is found, the program also tells you what    #
# the bit in between the b's was.                                     #
#######################################################################

# Loop over each line of input
while(<>)
{
  # Parse line (the "$_" variable containins the current line)
  $match_query = $_ =~ /b(0|1)b/x;

  # Print "MATCH!" or "NO MATCH!"
  if ($match_query)
  {
    print "MATCH! Bit was: $1\n";
  }
  else
  {
    print "NO MATCH!\n";
  }
}

Notice now that in the print statement $1 appears in the string to print. The value of the variable $1 will get substituted in when we go to print. Copy this program and run it on some test input. What do you get for "and another thing b3b is that", and what do you get for "a0abcd3c1cb1bxac".

Finding out what matched: Part II
When you have a complicated regular expression, with lots of things in ( )s you have to count opening parentheses to determine what number n to use in $n. In the following example, we have two parenthesized subexpressions, each matching a different part of the string. Ex3.pl
#!/usr/bin/perl
#######################################################################
# PROGRAM: This program prints "MATCH!" or "NO MATCH!" dependiing on
# whether or not a line contains a string like "<67B>", i.e. number
# followed by A or B, all in <>'s and without spaces.  When a match is
# found, the numeric and letter values are printed out.
#######################################################################

# Loop over each line of input
while(<>)
{
  # Parse line (the "$_" variable containins the current line)
  $match_query = $_ =~ /<([0-9]*)(A|B)>/x;

  # Print "MATCH!" or "NO MATCH!"
  if ($match_query)
  {
    print "MATCH! Value was: ($1,$2)\n";
  }
  else
  {
    print "NO MATCH!\n";
  }
}

Copy this program and try it on input like this:

The secret key <456 B> is
hidden <B678> in <007B> this 
document somewhere 56A but you'll never
find <22B it.
	
Sometimes parens are nested and things get complicated. For example, consider the expression E = x( letter(a|b)+ | number(0|1)+ )y. This is matched by strings like xletteraaby and xnumber1010y. In the first case, $1 = "letteraab", $2 = "aab", and $3 = "". In the second case, $1 = "xnumber1010", $2 = "", and $3 = "1010". Do you see why?

Note: It's okay to put redundant parentheses around parts of your regular expression just so you can use a $n to refer back to whatever matched them.

Perl Regular Expression Extras
Problem 1
Write a Perl program input that will print out all the computer science class numbers appearing in the file (assume no more than one per line). All computer science class numbers are "SI" followed by three digits (a \d matches a single digit in Perl), possibly followed by one capital letter. For example, SI204, SI486A and SI311 are all courses. on the other hand, IT221, SI32, and SI496r are not. If you give you program the input file
This semester SI433 is running for
the last time take it while you can.  
EM300 is running as always, but don't
try taking it out of sequence.  We
got two OSI2 people, one teaching
SI204, the other teaching IT310 as
well as SI283.  I don't think there
is a SI496A running this Spring.
The SI 2004 course offerings look
like they're nailed down, though.
You should get output:
433
204
283
496A
Problem 2
Write a Perl program that will print out whatever part of a line falls between "<b>" and "</b>". Now, since "/" means something in a regular expression, you need to backslash it. If you really want to match "hi/bye" for example, you'd use the regular expression "/hi\/bye/x". For input
There <b>is</b> a time to
play and there <b>is not</b>
a reason to panic!
your program should produce output
is
is not
	
Problem 3
Try running your Problem 2 solution on the input:
Welcome to <b>Joes</b> home of <b>FREE REFILLS</b>!!!
What you'll probably get is:
Joes</b> home of <b>FREE REFILLS
What happened? Well, after finding the first <b> it decided to match with the last </b>. By default, when you use a * or a + Perl will try to make the match as long as possible. Here we want it to make the match as short as possible. If you use *? or +? in place of * and + you get shortest possible matches instead. Modify your Problem 2 solution so that you get
Joes
on the above input. In other words, you get the first occurrence of something delimited by <b> and </b>.
Problem 4
A lot of times you can afford to be sloppy by writing a regular expression that accepts more than you really want, as long as it probably won't show up the inputs you give your program, and especially if there's an easy check that a human can ultimately give. Cosider my favorite: Meeting times of courses. Meeting times look like a string of at most 5 days of the week (M,T,W,R,F) followed by a string of at most 2 class periods (1,2,3,4,5,6,8,9,10). While it's true that there shouldn't be duplicates of periods or days, what're the odds it'll come up in your input? Be lazy! The file input1 is html source code. It has a boatload of meeting times in it. Write a Perl script that will print out the meeting times it contains. If the script prints out a couple of extras ... so be it. If you program is lab4.pl, run it like this:
lab4.pl < ~wcbrown/courses/SI472/classes/L19/input1
	
That'll save you copying and pasting this big ugly file! Hint: If you want between m and n occurences of something in a regular expression, {m,n} does the trick. It's like a limited * or +. So, for example, if you wanted a number bewteen 10 and 9999 you might use the regular expression [1-9]\d{1,3}.
A Cool Perl Program
Check out this very cool Perl program. It scans lines to determine whether a phone number (with area code) is there. It finds the number and prints out its three componants. It's able to deal with all sorts of formats for writing phone numbers! script1.0.html


Christopher W Brown
Last modified: Tue Nov 25 09:30:55 EST 2003