SI342: Regular Expressions

Reading

Section 2.4 of Theory of Computing, A Gentle Introduction

Homework

Your homework is to work throught the regular expressions exercises delivered by "Regexex". By class on Wednesday, you should be able to show that you have completed at least five of the exercises. By Monday you should try to have them all completed. Here's how to use Regexex:

Install regexexT
In your VM, give the command

scp mALPHA@ssh.cs.usna.edu:/courses/wcbrown/regexexT ~/bin/	      
chmod +rx ~/bin/regexexT

Run regexexT
Run regexexT with the command (Note: be patient, it may be really slow to load ... I'm trying to track down why.)
```
regexexT https://www.usna.edu/Users/cs/wcbrown/courses/F24SI342/ps01
```
Once it's running, type 'help' to get a list of commands, and get started answering questions.
Using regexexT
If you exit and go back later, your progress will be remembered. You can print out a "certificate" with the 'cert' command, which validates that you have solved a certain number of questions.

On Monday, 30 September, bring a printout of the "certificate" to show how many you've solved. On Wednesday, you will be bringing your laptops to class. Be prepared to show that you have at least five solved by that point.

What we know

Summing up pretty much everything we've done this far, here's what we know: Let Σ be an alphabet:

The language { } can be accepted by a finite automaton.
The language {λ} can be accepted by a finite automaton.
For a ∈ Σ, the language {a} can be accepted by a finite automaton.
If L₁ and L₂ are languages accepted by finite automata, then L₁∪L₂ (union), L₁∩L₂ (intersection), and L₁L₂ (concatenation) are accepted by finite automata.
If L is a language accepted by finite a automaton, then L* (Kleene Star) is accepted by a finite automaton, and L (complement) is accepted by a finite automaton.

Notice that we don't need to specify whether the finite automata are deterministic or nondeterministic, because we proved that you can convert from one to the other and back. That was one of our biggest results! We actually have a whole host of other smaller theorems, but the above are our main results.

Regular Expressions

Finite automata are great tools for recognizing languages, i.e. determining whether strings are in a given language, but are really not too good for specifying languages. It'd be nice to have a short "programming language" for specifying a language like "the set of all strings over {a,b,c} that start with a and end with c. Regular Expressions are exactly that. The regular expression for this language is:

a(a|b|c)*c

The | means "union" (though you'll usually read it as "or"), and the * means "Kleene Star". So, this regular expression specifies a language of strings consisting of three parts:

the character a, followed by
zero or more occurences of a or b or c, followed by
the character c.

Parenteheses are used for grouping, just like with arithmetic. We will agree that precedence of "or", "concatenation" and "Kleene Star" are like plus, times and exponentiation respectively. Thus, aa|bb defines the language {aa,bb}, i.e. aa|bb = (aa)|(bb). What does a(a|b)b define? It defines the language {aab,abb}. Similarly, ab*c = a(b*)c and defines the language {ac,abc,abbc,abbbc,...}. Don't worry, I'm not going to try to trick you with wierd precendence problems!

Sometimes we want to match the empty string, like this

(a|c|λ)b*

... which would match all strings any number of b's, possibly preceeded by a single a or c.

Applications of regular expressions - Specification

Regular expressions are primarily used for specifications and for pattern matching. Let's consider specification first. The HTTP 1.1 specification (HTTP is the protocol underlying the web) all sorts of things. One of them is a specification for how the HTTP version that a system is using is to be communicated to another system. Here's what an HTTP version must look like according to the spec, using our regular expression syntax. [Note: we'll let D stand for a digit.]

HTTP/DD*.DD*

This says, for instance that "HTTP/0.12" is valid, but "HTTP/.12" is not. Do you see why? Now, there are many different dialects of regular expressions. The dialect the HTTP spec uses is a bit different. It uses "1*" to mean "one or more occurences", and Kleene-Star-like operators come before their arguments instead of after. Character literals like the "H" in our regular expression must appear in quotes, and DIGITS stands for ... well, for digits, of course. Given that, you should be able to read the following, which is excerted from the HTTP 1.1 specification:

 The version of an HTTP message is indicated by an HTTP-Version field in the first line of the message.

       HTTP-Version   = "HTTP" "/" 1*DIGIT "." 1*DIGIT

Another nice example comes from the HTTP definition of header field syntax. The specification states that HTTP header field content follows:

field-content  = field-vchar [ 1*( SP / HTAB ) field-vchar ]

In our syntax this would be the regular expression

c ( λ | (' '|'\t')(' '|'\t')* c )

where c stands for the regular expression for all printible characters, i.e. (!|"|#|...|}|∼)

Regular Expressions and pattern matching

Regular expressions are often used for specifying patterns that programs then can try to match in data. How this is done will be a big part of our discussion in coming classes. This gets used in many, many contexts. For one, regular expressions for pattern matching is a huge part of modern programming. Next class we'll do a lab about it. It's also part of many important system tools, like egrep, our next topic.

Regular Expressions and egrep

The program egrep searches files for matches of regular expressions. Specifically, it looks at a line of text and checks whether the line contains an occurence of the given regular expression. If it does, the line gets printed, otherwise it doesn't. For example, suppose you had a Java class Foo and you couldn't remember what other files in your project used Foo you could simply run

egrep Foo *.java

... and every line of every .java file that contained the string "Foo" would be printed out, prefaced with the name of the file it came from. Here's a more interesting example. Suppose I want to find all my lectures that use the word "Nondeterminstic", or any variations, like "Non-Determinsm", and so forth. I cd to my "classes" directory and do:

> egrep '(N|n)on(-*)(D|d)eterm' L*/lec.html
L05/lec.html: look at <em>non-deterministic</em> finite automata later.)
L10/lec.html:    <h1>Class 10: Finishing Proof, Starting Nondeterministic Finite Automata</h1>
L10/lec.html: <em>non-deterministic</em>, i.e. we can't tell exactly what
L10/lec.html:   A non-deterministic finite automaton accepts a string if
L10/lec.html: makes sense for any non-deterministic machine.
L10/lec.html:      <dt><b>Summing up Non-Deterministic Finite Automata</b></dt>
--- AND LOTS MORE -----

Take a look at where and how we got these matches. Now, this lists lots of lines (73 lines, as "

egrep
	'(N|n)on(-*)(D|d)eterm' L*/lec.html | wc -l

" tells me), and I get the whole line, not just the file name. Without explaining why, I can extract just the file names with the following:

> egrep '(N|n)on(-*)(D|d)eterm' L*/lec.html | sed 's/:.*//' | sort -u
L05/lec.html
L10/lec.html
L11/lec.html
L12/lec.html
L13.0/lec.html
L13/lec.html
L14/lec.html
L15/lec.html
L17/lec.html

You can run egrep without an input file, as well, in which case it simply echos or doesn't echo each line you print depending on whether the given regular expression is found.

Example: Check whether a line contains an occurence of aabb or bbaa. The regular expression (aabb)|(bbaa) is what we want, since egrep echos lines that contain a matching string ... they don't have to match exactly. In the below, what the user types is in red, what the computer prints is in black.
valiant[1] [~/]> egrep '(aabb)|(bbaa)'
babbaaab
babbaaab
abababb
bbbbabaabbbaa
bbbbabaabbbaa
	  

See what happened? When egrep finds a substring of the line that matches our regular expression, it echos the line. We can force the entire line to match the regular expression, and not just some substring, by putting a ^ at the beginning of the expression and a $ at the end.

Example: Check whether a line is aabb or bbaa, not merely whether it contains either of these strings.
valiant[2] [~/]> egrep '^(aabb)|(bbaa)$'
abbaa
aabb
aabb
bbbaa
bbaa
bbaa

More egrep and Unix regular expression information is available under the course resources link, or here.

Formal definition of a regular expression

Now that we are (hopefully) familiar with using regular expressions in an informal way, it's time to give a formal defintions of a regular expression

Definition: A regular expression over the alphabet Σ is an expression derived from the following rules:

λ, ∅, and a, where a ∈ Σ, are all regular expressions

if E is a regular expression then (E)* is a regular expression

if E1 and E2 are regular expressions then (E1)(E2) and (E1)|(E2) are regular expressions

nothing else is a regular expression!

A regular expression defines a language in what is hopefully at this point the obvious way. (The book gives a technical definition of the language defined by a regular expression.)

Note on egrep: The egrep program is an extremely useful tool and a good way to play around with regular expressions. However, bear in mind that it is a practical application of regular expressions, so it has all sorts of fun features added on that are useful, but which confuse the theoretical model without increasing its theoretical power. So our theoretical model is the stripped down Ford Festiva next to egrep's $100,000 Mercedes. Also, egrep leaves out λ because allowing it makes the implementation less efficient, and in practice one can always find a way to write the regular expressions you need without using λ.

Christopher W Brown

Last modified: Tue Oct 6 11:11:50 EDT 2009