SI340: Regular Expressions in Perl

Reading: None.
Homework: Your homework is to finish the Problems below. AND, you must look at this comic ... and this one too.
Turn In: a filled in printout of this form.

What is a Perl

Perl is an odd but very useful programming lanugage. We would call it a "scripting" language, because instead of writing programs that get compiled into binary's, your source code is a "script" that gets sent to a Perl interpreter to run. It's great for processing text, which makes it very useful to system administrators and web developers. Part of its power stems from its built-in regular expression utilities, which is why we're taking a look at it.

Okay, here's about as simple a Perl program as you're going to see. It looks both similar to and different from C++:

Ex0.pl

input0

Run of program

Ex0.pl

#!/usr/bin/perl

$count = 0; # Creates variable $count and initialize to zero

while(<>)   # Loop over each line of input
{
  if ($_ eq "\n")  # The $_ variable containins the current line
  {
    $count++;
  }
}

print "There were $count blank lines!\n";

here's the first line

here's another
and still more

but no more after ...
this!

valiant> Ex0.pl < input0
There were 2 blank lines!

Note to treat the text file Ex0.pl as a program you must first make it executable with the command chmod +x Ex0.pl.

Anything following a # is a comment in Perl. So the file starts out with #!/usr/bin/perl. Every Perl program should start with this. It's how the operating system will know to use the Perl interpreter to "run" the script. Variables in Perl start with a "$". You don't need to declare them, you can just go ahead and use them. Strings can be compared with the eq operator, and you need to surround the body of an "if" or "while" with { }s even if it's only one statement. Other than that, simply accept that "while (<>)" loops over each line of input, and that the variable "$_ inside the while-loop is a string containing the line just read.

If you want to learn Perl, you can get the O'Reilly book "Perl in a Nutshell". I'll look into whether that's available digitally from any of the library's resources.

A simple Perl program using regular expressions

The perl syntax for parsing a regular expression looks like this

string =~ /regular expression/x

The way this works, is that Perl will try to see if string contains a match with regular expression. The whole expression is true if there's a match and false otherwise. Regular expressions in Perl follow the same basic rules as in egrep, except that, because I followed the whole thing with an x, whitespace is ignored in the expression (e.g. /a(bb)*c/x is equivalent to /a (bb)* c/x). You need to use \s to match a whitespace character. What follows is a simple Perl program that uses regular expressions to determine whether a line contains either b0b or b1b. Ex1.pl

#!/usr/bin/perl
#######################################################################
# PROGRAM: This program reads input and prints "MATCH!" for each line #
# containing "b0b" or "b1b", "NO MATCH!" for each line that doesn't.  #
#######################################################################

# Loop over each line of input
while(<>)
{
  # Parse line (the "$_" variable is the current line)
  $match_query = $_ =~ /b(0|1)b/x;

  # Print "MATCH!" or "NO MATCH!"
  if ($match_query) 
  {
    print "MATCH!\n";
  }
  else
  {
    print "NO MATCH!\n";
  }
}

Copy this program and run it a few times to see how it works. Is the line "axorq8,e3k0ka1b0bkca4bb2b" a match? Can you modify this program so that instead of a 0 or 1 in between the b's you can have any digit from 0 up to 9? Hint: Do it!

Finding out what matched

Suppose that in the above example we wanted not only to know whether there was a match with a(0|1)b, but also to know what matched the (0|1) part -- i.e. did we get a zero or a one in there? In Perl we can refer to the string that matched a parenthesized part of a regular expression by $n, where the left parenthesis of the expression is the nth opening parenthesis of the regular expression. In the above, there is only one open paren, so $1 refers to the string that matched the (0|1). This is just a variable that we can print, or test or do whatever with in Perl. Ex2.pl

#!/usr/bin/perl
#######################################################################
# PROGRAM: This program reads input and prints "MATCH!" for each line #
# containing "b0b" or "b1b", and "NO MATCH!" for each line that       #
# doesn't.  When a match is found, the program also tells you what    #
# the bit in between the b's was.                                     #
#######################################################################

# Loop over each line of input
while(<>)
{
  # Parse line (the "$_" variable containins the current line)
  $match_query = $_ =~ /b(0|1)b/x;

  # Print "MATCH!" or "NO MATCH!"
  if ($match_query)
  {
    print "MATCH! Bit was: $1\n";
  }
  else
  {
    print "NO MATCH!\n";
  }
}

Notice now that in the print statement $1 appears in the string to print. The value of the variable $1 will get substituted in when we go to print. Copy this program and run it on some test input. What do you get for "and another thing b3b is that", and what do you get for "a0abcd3c1cb1bxac".

Finding out what matched: Part II

When you have a complicated regular expression, with lots of things in ( )s you have to count opening parentheses to determine what number n to use in $n. In the following example, we have two parenthesized subexpressions, each matching a different part of the string. Ex3.pl

#!/usr/bin/perl
#######################################################################
# PROGRAM: This program prints "MATCH!" or "NO MATCH!" dependiing on
# whether or not a line contains a string like "<67B>", i.e. number
# followed by A or B, all in <>'s and without spaces.  When a match is
# found, the numeric and letter values are printed out.
#######################################################################

# Loop over each line of input
while(<>)
{
  # Parse line (the "$_" variable containins the current line)
  $match_query = $_ =~ /<([0-9]*)(A|B)>/x;

  # Print "MATCH!" or "NO MATCH!"
  if ($match_query)
  {
    print "MATCH! Value was: ($1,$2)\n";
  }
  else
  {
    print "NO MATCH!\n";
  }
}

Copy this program and try it on input like this:

The secret key <456 B> is
hidden <B678> in <007B> this 
document somewhere 56A but you'll never
find <22B it.

Sometimes parens are nested and things get complicated. For example, consider the expression E = x( letter(a|b)+ | number(0|1)+ )y. This is matched by strings like xletteraaby and xnumber1010y. In the first case, $1 = "letteraab",

$2 =
	  "aab"

, and $3 = "". In the second case, $1 = "number1010", $2 = "", and $3 = "1010". Do you see why?

Note: It's okay to put redundant parentheses around parts of your regular expression just so you can use a $n to refer back to whatever matched them.

Perl Regular Expression Extras

\d this matches a digit, so the regular expression /(a|b)\d/x matches a0,b0,a1,b1,a2,b2,...,a9,b9.
? this matches 0 or 1 occurence of whatever preceeded it. So, for example, the regular expression /1?\d/x matches all the numbers from zero up to 19.
[x-y] Ranges (as for egrep). For example, the regular expression /[A-D]|F/x matches the valid letter grades, A, B, C, D and F.
. this matches any character (except a newline). So, for example, if you want to match whatever comes between < and >, you'd use the regular expression /<(.*)>/x and $1 would give you whatever came between the < >s.

Problem 1

Write a Perl program input that will print out all the "SI" computer science class numbers appearing in the file (assume no more than one per line). In other words, we want all class numbers that are "SI" followed by three digits (note: a \d matches a single digit in Perl), possibly followed by one capital letter. For example, SI204, SI486A and SI311 are all "SI" computer science courses. on the other hand, IT221, SI32, and SI496r are not. If you give you program the input file

This semester SI433 is running for
the last time take it while you can.  
EM300 is running as always, but don't
try taking it out of sequence.  We
got two OSI2 people, one teaching
SI204 (or is it IC210?), the other 
teaching IT310 as well as SI283.  I 
don't think there is a SI496A running 
this Spring. The SI 2008 course 
offerings look like they're nailed 
down, though.

You should get output:

Problem 2

Write a Perl program that will print out whatever part of a line falls between "" and "". Now, since "/" means something in a regular expression, you need to backslash it. If you really want to match "hi/bye" for example, you'd use the regular expression "/hi\/bye/x". For input

There <b>is</b> a time to
play and there <b>is not</b>
a reason to panic!

your program should produce output

is
is not

Problem 3

Try running your Problem 2 solution on the input:

Welcome to <b>Joes</b> home of <b>FREE REFILLS</b>!!!

What you'll probably get is:

Joes</b> home of <b>FREE REFILLS

What happened? Well, after finding the first it decided to match with the last . By default, when you use a * or a + Perl will try to make the match as long as possible. Here we want it to make the match as short as possible. If you use *? or +? in place of * and + you get shortest possible matches instead. Modify your Problem 2 solution so that you get

Joes

on the above input. In other words, you get the first occurrence of something delimited by and .

Problem 4

A lot of times you can afford to be sloppy by writing a regular expression that accepts more than you really want, as long as the extra stuff probably won't show up in the inputs you give your program, and especially if there's an easy check that a human can ultimately give. Cosider my favorite: Meeting times of courses. Meeting times look like a string of at most 5 days of the week (M,T,W,R,F) followed by a string of at most 2 class periods (1,2,3,4,5,6,8,9,10) -- e.g. "MWF3" or "R34". While it's true that there shouldn't be duplicates of periods or days, what're the odds it'll come up in your input? Be lazy! The file input1 is html source code. It has a boatload of meeting times in it. Write a Perl script that will print out the meeting times it contains. If the script prints out a couple of extras ... so be it. If you program is lab4.pl, run it like this:

lab4.pl < ~wcbrown/courses/F11SI340/classes/L17/input1

That'll save you copying and pasting this big ugly file! Hint: If you want between m and n occurences of something in a regular expression, {m,n} does the trick. It's like a limited * or +. So, for example, if you wanted a number between 10 and 9999 you might use the regular expression [1-9]\d{1,3}.

A Cool Perl Program

Check out this very cool Perl program. It scans lines to determine whether a phone number (with area code) is there. It finds the number and prints out its three componants. It's able to deal with all sorts of formats for writing phone numbers! view script1.0.pl If you prefer, view script2.0.pl and you'll actually get comments, nice formatting, and all that other sissy stuff we tried to get you to do in SI204/IC210.

Christopher W Brown