scp mALPHA@ssh.cs.usna.edu:/courses/wcbrown/regexexT ~/bin/
chmod +rx ~/bin/regexexT
regexexT https://www.usna.edu/Users/cs/wcbrown/courses/F24SI342/ps01Once it's running, type 'help' to get a list of commands, and get started answering questions.
Notice that we don't need to specify whether the finite automata are deterministic or nondeterministic, because we proved that you can convert from one to the other and back. That was one of our biggest results! We actually have a whole host of other smaller theorems, but the above are our main results.
a(a|b|c)*cThe | means "union" (though you'll usually read it as "or"), and the * means "Kleene Star". So, this regular expression specifies a language of strings consisting of three parts:
Sometimes we want to match the empty string, like this
(a|c|λ)b*... which would match all strings any number of b's, possibly preceeded by a single a or c.
HTTP/DD*.DD*This says, for instance that "HTTP/0.12" is valid, but "HTTP/.12" is not. Do you see why? Now, there are many different dialects of regular expressions. The dialect the HTTP spec uses is a bit different. It uses "1*" to mean "one or more occurences", and Kleene-Star-like operators come before their arguments instead of after. Character literals like the "H" in our regular expression must appear in quotes, and
DIGITS stands for ... well, for digits, of
course.
Given that, you should be able to read the following, which is excerted from the HTTP 1.1 specification:
The version of an HTTP message is indicated by an HTTP-Version field in the first line of the message.
HTTP-Version = "HTTP" "/" 1*DIGIT "." 1*DIGIT
Another nice example comes from the HTTP definition of header field syntax. The specification states that HTTP header field content follows:
field-content = field-vchar [ 1*( SP / HTAB ) field-vchar ]In our syntax this would be the regular expression
c ( λ | (' '|'\t')(' '|'\t')* c )
where c stands for the regular expression for all printible
characters, i.e. (!|"|#|...|}|∼)
egrep, our next topic.
egrepegrep searches files for matches of
regular expressions. Specifically, it looks at a line of text
and checks whether the line contains an occurence of the given
regular expression. If it does, the line gets printed,
otherwise it doesn't.
For example, suppose you had a Java class Foo and
you couldn't remember what other files in your project used
Foo you could simply run
egrep Foo *.java... and every line of every .java file that contained the string "Foo" would be printed out, prefaced with the name of the file it came from. Here's a more interesting example. Suppose I want to find all my lectures that use the word "Nondeterminstic", or any variations, like "Non-Determinsm", and so forth. I cd to my "classes" directory and do:
> egrep '(N|n)on(-*)(D|d)eterm' L*/lec.html
L05/lec.html: look at <em>non-deterministic</em> finite automata later.)
L10/lec.html: <h1>Class 10: Finishing Proof, Starting Nondeterministic Finite Automata</h1>
L10/lec.html: <em>non-deterministic</em>, i.e. we can't tell exactly what
L10/lec.html: A non-deterministic finite automaton accepts a string if
L10/lec.html: makes sense for any non-deterministic machine.
L10/lec.html: <dt><b>Summing up Non-Deterministic Finite Automata</b></dt>
--- AND LOTS MORE -----
Take a look at where and how we got these matches.
Now, this lists lots of lines (73 lines, as "egrep
'(N|n)on(-*)(D|d)eterm' L*/lec.html | wc -l" tells
me), and I get the whole line, not just the file name.
Without explaining why, I can extract just the file names with
the following:
> egrep '(N|n)on(-*)(D|d)eterm' L*/lec.html | sed 's/:.*//' | sort -u
L05/lec.html
L10/lec.html
L11/lec.html
L12/lec.html
L13.0/lec.html
L13/lec.html
L14/lec.html
L15/lec.html
L17/lec.html
You can run egrep without an input file,
as well, in which case it simply echos or doesn't echo each
line you print depending on whether the given regular
expression is found.
Example: Check whether a line contains an occurence of aabb or bbaa. The regular expression (aabb)|(bbaa) is what we want, sinceSee what happened? Whenegrepechos lines that contain a matching string ... they don't have to match exactly. In the below, what the user types is in red, what the computer prints is in black.valiant[1] [~/]> egrep '(aabb)|(bbaa)' babbaaab babbaaab abababb bbbbabaabbbaa bbbbabaabbbaa
egrep finds a substring
of the line that matches our regular expression, it echos the
line. We can force the entire line to match the
regular expression, and not just some substring, by putting a
^ at the beginning of the expression and a $ at
the end.
Example: Check whether a line is aabb or bbaa, not merely whether it contains either of these strings.Morevaliant[2] [~/]> egrep '^(aabb)|(bbaa)$' abbaa aabb aabb bbbaa bbaa bbaa
egrep and Unix regular expression
information is available under the course resources link,
or here.
Definition: A regular expression over the alphabet Σ is an expression derived from the following rules:A regular expression defines a language in what is hopefully at this point the obvious way. (The book gives a technical definition of the language defined by a regular expression.)
- λ, ∅, and a, where a ∈ Σ, are all regular expressions
- if E is a regular expression then (E)* is a regular expression
- if E1 and E2 are regular expressions then (E1)(E2) and (E1)|(E2) are regular expressions
- nothing else is a regular expression!
Note on egrep:
The egrep program is an extremely useful tool and
a good way to play around with regular expressions. However,
bear in mind that it is a practical application of
regular expressions, so it has all sorts of fun features added
on that are useful, but which confuse the theoretical model
without increasing its theoretical power.
So our theoretical model is the stripped down Ford Festiva
next to egrep's $100,000 Mercedes.
Also, egrep leaves out λ because allowing it makes
the implementation less efficient, and in practice one can
always find a way to write the regular expressions you need
without using λ.