Class 16: Regular Expressions


Reading
Section 2.4 of Theory of Computing, A Gentle Introduction
Homework
Finish the RegeXeX exercises at:
/home/wcbrown/bin/regexexbrown &       ← Dr. Brown's sections
/home/wcbrown/bin/regexexortiz &       ← Capt Ortiz's sections
You can ssh into one of the Unix machines and run regexex of it using Cygwin on your PC in Bancroft ... or you can just do it in the lab.

What we know
Summing up pretty much everything we've done this far, here's what we know: Let Σ be an alphabet:

  1. The language { } can be accepted by a finite automaton.
  2. The language {λ} can be accepted by a finite automaton.
  3. For a ∈ Σ, the language {a} can be accepted by a finite automaton.
  4. If L1 and L2 are languages accepted by finite automata, then L1∪L2 (union), L1∩L2 (intersection), and L1L2 (concatenation) are accepted by finite automata.
  5. If L is a language accepted by finite a automaton, then L* (Kleene Star) is accepted by a finite automaton, and L (complement) is accepted by a finite automaton.

Notice that we don't need to specify whether the finite automata are deterministic or nondeterministic, because we proved that you can convert from one to the other and back. That was one of our biggest results! We actually have a whole host of other smaller theorems, but the above are our main results.

Regular Expressions
Finite automata are great tools for recognizing languages, i.e. determining whether strings are in a given language, but are really not too good for specifying languages. It'd be nice to have a short "programming language" for specifying a language like "the set of all strings over {a,b,c} that start with a and end with c. Regular Expressions are exactly that. The regular expression for this language is:
a(a|b|c)*c
The | means "union" (though you'll usually read it as "or"), and the * means "Kleene Star". So, this regular expression specifies a language of strings consisting of three parts:
  1. the character a, followed by
  2. zero or more occurences of a or b or c, followed by
  3. the character c.
Parenteheses are used for grouping, just like with arithmetic. We will agree that precedence of "or", "concatenation" and "Kleene Star" are like plus, times and exponentiation respectively. Thus, aa|bb defines the language {aa,bb}, i.e. aa|bb = (aa)|(bb). What does a(a|b)b define? It defines the language {aab,abb}. Similarly, ab*c = a(b*)c and defines the language {ac,abc,abbc,abbbc,...}. Don't worry, I'm not going to try to trick you with wierd precendence problems!

Applications of regular expressions
Regular expressions are primarily used for specifications and for pattern matching. Let's consider specification first. The HTTP 1.1 specification (HTTP is the protocol underlying the web) all sorts of things. One of them is a specification for how the HTTP version that a system is using is to be communicated to another system. Here's what an HTTP version must look like according to the spec, using our regular expression syntax. [Note: we'll let D stand for a digit.]
HTTP/DD*.DD*
This says, for instance that "HTTP/0.12" is valid, but "HTTP/.12" is not. Do you see why? Now, there are many different dialects of regular expressions. The dialect the HTTP spec uses is a bit different. It uses "1*" to mean "one or more occurences", and Kleene-Star-like operators come before their arguments instead of after. Character literals like the "H" in our regular expression must appear in quotes, and DIGITS stands for ... well, for digits, of course. Given that, you should be able to read the following, which is excerted from the HTTP 1.1 specification:
 The version of an HTTP message is indicated by an HTTP-Version field in the first line of the message.

       HTTP-Version   = "HTTP" "/" 1*DIGIT "." 1*DIGIT
	
Regular expressions for pattern matching is a huge part of modern programming. Next class we'll do a lab about it. It's also part of many important system tools, like egrep, our next topic.

Regular Expressions and egrep
The program egrep searches files for matches of regular expressions. Specifically, it looks at a line of text and checks whether the line contains an occurence of the given regular expression. If it does, the line gets printed, otherwise it doesn't. For example, suppose you had a Java class Foo and you couldn't remember what other files in your project used Foo you could simply run
egrep Foo *.java
... and every line of every .java file that contained the string "Foo" would be printed out, prefaced with the name of the file it came from. Here's a more interesting example. Suppose I want to find all my lectures that use the word "Nondeterminstic", or any variations, like "Non-Determinsm", and so forth. I cd to my "classes" directory and do:
> egrep '(N|n)on(-*)(D|d)eterm' L*/Class.html
L05/Class.html: look at <em>non-deterministic</em> finite automata later.)
L10/Class.html:    <h1>Class 10: Finishing Proof, Starting Nondeterministic Finite Automata</h1>
L10/Class.html: <em>non-deterministic</em>, i.e. we can't tell exactly what
L10/Class.html:   A non-deterministic finite automaton accepts a string if
L10/Class.html: makes sense for any non-deterministic machine.
L10/Class.html:      <dt><b>Summing up Non-Deterministic Finite Automata</b></dt>
--- AND LOTS MORE -----
Take a look at where and how we got these matches. Now, this lists lots of lines (73 lines, as "egrep '(N|n)on(-*)(D|d)eterm' L*/Class.html | wc -l" tells me), and I get the whole line, not just the file name. Without explaining why, I can extract just the file names with the following:
> egrep '(N|n)on(-*)(D|d)eterm' L*/Class.html | sed 's/:.*//' | sort -u
L05/Class.html
L10/Class.html
L11/Class.html
L12/Class.html
L13.0/Class.html
L13/Class.html
L14/Class.html
L15/Class.html
L17/Class.html

You can run egrep without an input file, as well, in which case it simply echos or doesn't echo each line you print depending on whether the given regular expression is found.
Example: Check whether a line contains an occurence of aabb or bbaa. The regular expression (aabb)|(bbaa) is what we want, since egrep echos lines that contain a matching string ... they don't have to match exactly. In the below, what the user types is in red, what the computer prints is in black.
valiant[1] [~/]> egrep '(aabb)|(bbaa)'
babbaaab
babbaaab
abababb
bbbbabaabbbaa
bbbbabaabbbaa
	  
See what happened? When egrep finds a substring of the line that matches our regular expression, it echos the line. We can force the entire line to match the regular expression, and not just some substring, by putting a ^ at the beginning of the expression and a $ at the end.
Example: Check whether a line is aabb or bbaa, not merely whether it contains either of these strings.
valiant[2] [~/]> egrep '^(aabb)|(bbaa)$'
abbaa
aabb
aabb
bbbaa
bbaa
bbaa
More egrep and Unix regular expression information is available under the course resources link, or here.

Formal definition of a regular expression
Now that we are (hopefully) familiar with using regular expressions in an informal way, it's time to give a formal defintions of a regular expression
Definition: A regular expression over the alphabet Σ is an expression derived from the following rules:
  1. λ, ∅, and a, where a ∈ Σ, are all regular expressions
  2. if E is a regular expression then (E)* is a regular expression
  3. if E1 and E2 are regular expressions then (E1)(E2) and (E1)|(E2) are regular expressions
  4. nothing else is a regular expression!
A regular expression defines a language in what is hopefully at this point the obvious way. (The book gives a technical definition of the language defined by a regular expression.)

Note on egrep: The egrep program is an extremely useful tool and a good way to play around with regular expressions. However, bear in mind that it is a practical application of regular expressions, so it has all sorts of fun features added on that are useful, but which confuse the theoretical model without increasing its theoretical power. So our theoretical model is the stripped down Ford Festiva next to egrep's $100,000 Mercedes. Also, egrep leaves out λ because allowing it makes the implementation less efficient, and in practice one can always find a way to write the regular expressions you need without using λ.

RegeXeX: Into the lab to learn how to write regular expressions
The program RegeXeX tutors you in the writing of regular expressions. LT Hardisty and I wrote it to help in teaching theory of computing. What's interesting is that you should be able to figure out for yourselves how it works before this semester is over!


Christopher W Brown
Last modified: Tue Oct 6 11:11:50 EDT 2009