/home/wcbrown/bin/regexexbrown & ← Dr. Brown's sections /home/wcbrown/bin/regexexortiz & ← Capt Ortiz's sectionsYou can ssh into one of the Unix machines and run regexex of it using Cygwin on your PC in Bancroft ... or you can just do it in the lab.
Notice that we don't need to specify whether the finite automata are deterministic or nondeterministic, because we proved that you can convert from one to the other and back. That was one of our biggest results! We actually have a whole host of other smaller theorems, but the above are our main results.
a(a|b|c)*cThe | means "union" (though you'll usually read it as "or"), and the * means "Kleene Star". So, this regular expression specifies a language of strings consisting of three parts:
HTTP/DD*.DD*This says, for instance that "HTTP/0.12" is valid, but "HTTP/.12" is not. Do you see why? Now, there are many different dialects of regular expressions. The dialect the HTTP spec uses is a bit different. It uses "1*" to mean "one or more occurences", and Kleene-Star-like operators come before their arguments instead of after. Character literals like the "H" in our regular expression must appear in quotes, and
DIGITSstands for ... well, for digits, of course. Given that, you should be able to read the following, which is excerted from the HTTP 1.1 specification:
The version of an HTTP message is indicated by an HTTP-Version field in the first line of the message. HTTP-Version = "HTTP" "/" 1*DIGIT "." 1*DIGITRegular expressions for pattern matching is a huge part of modern programming. Next class we'll do a lab about it. It's also part of many important system tools, like
egrep, our next topic.
egrepsearches files for matches of regular expressions. Specifically, it looks at a line of text and checks whether the line contains an occurence of the given regular expression. If it does, the line gets printed, otherwise it doesn't. For example, suppose you had a Java class
Fooand you couldn't remember what other files in your project used
Fooyou could simply run
egrep Foo *.java... and every line of every .java file that contained the string "Foo" would be printed out, prefaced with the name of the file it came from. Here's a more interesting example. Suppose I want to find all my lectures that use the word "Nondeterminstic", or any variations, like "Non-Determinsm", and so forth. I cd to my "classes" directory and do:
> egrep '(N|n)on(-*)(D|d)eterm' L*/Class.html L05/Class.html: look at <em>non-deterministic</em> finite automata later.) L10/Class.html: <h1>Class 10: Finishing Proof, Starting Nondeterministic Finite Automata</h1> L10/Class.html: <em>non-deterministic</em>, i.e. we can't tell exactly what L10/Class.html: A non-deterministic finite automaton accepts a string if L10/Class.html: makes sense for any non-deterministic machine. L10/Class.html: <dt><b>Summing up Non-Deterministic Finite Automata</b></dt> --- AND LOTS MORE -----Take a look at where and how we got these matches. Now, this lists lots of lines (73 lines, as "
egrep '(N|n)on(-*)(D|d)eterm' L*/Class.html | wc -l" tells me), and I get the whole line, not just the file name. Without explaining why, I can extract just the file names with the following:
> egrep '(N|n)on(-*)(D|d)eterm' L*/Class.html | sed 's/:.*//' | sort -u L05/Class.html L10/Class.html L11/Class.html L12/Class.html L13.0/Class.html L13/Class.html L14/Class.html L15/Class.html L17/Class.htmlYou can run egrep without an input file, as well, in which case it simply echos or doesn't echo each line you print depending on whether the given regular expression is found.
Example: Check whether a line contains an occurence of aabb or bbaa. The regular expression (aabb)|(bbaa) is what we want, sinceSee what happened? When
egrepechos lines that contain a matching string ... they don't have to match exactly. In the below, what the user types is in red, what the computer prints is in black.valiant [~/]> egrep '(aabb)|(bbaa)' babbaaab babbaaab abababb bbbbabaabbbaa bbbbabaabbbaa
egrepfinds a substring of the line that matches our regular expression, it echos the line. We can force the entire line to match the regular expression, and not just some substring, by putting a ^ at the beginning of the expression and a $ at the end.
Example: Check whether a line is aabb or bbaa, not merely whether it contains either of these strings.Morevaliant [~/]> egrep '^(aabb)|(bbaa)$' abbaa aabb aabb bbbaa bbaa bbaa
egrepand Unix regular expression information is available under the course resources link, or here.
Definition: A regular expression over the alphabet Σ is an expression derived from the following rules:A regular expression defines a language in what is hopefully at this point the obvious way. (The book gives a technical definition of the language defined by a regular expression.)
- λ, ∅, and a, where a ∈ Σ, are all regular expressions
- if E is a regular expression then (E)* is a regular expression
- if E1 and E2 are regular expressions then (E1)(E2) and (E1)|(E2) are regular expressions
- nothing else is a regular expression!
Note on egrep:
egrep program is an extremely useful tool and
a good way to play around with regular expressions. However,
bear in mind that it is a practical application of
regular expressions, so it has all sorts of fun features added
on that are useful, but which confuse the theoretical model
without increasing its theoretical power.
So our theoretical model is the stripped down Ford Festiva
next to egrep's $100,000 Mercedes.
Also, egrep leaves out λ because allowing it makes
the implementation less efficient, and in practice one can
always find a way to write the regular expressions you need
without using λ.