Class 18: Regular Expressions and Finite Automata


Reading
Section 2.4 of Theory of Computing, A Gentle Introduction and, of course, these notes. Make sure you know what a "regular language" is!
Homework
Repeat to yourself 10 times: Any lanugage accepted by a DFA, can be accepted by an NDFA. Any language accepted by an NDFA can be defined by a Regular Expression. Any language defined by a Regular Expression can be accepted by an NDFA. Any language accepted by an NDFA can be accepted by a DFA. Then printout the homework and answer the questions on that paper.

Repeat: Formal definition of a regular expression
Now that we are (hopefully) familiar with using regular expressions in an informal way, it's time to give a formal defintions of a regular expression
definition of regular expression definition of language defined by a regular expression
Definition: A regular expression over the alphabet Σ is an expression derived from the following rules:
  1. λ, ∅, and a, where a ∈ Σ, are all regular expressions
  2. if E is a regular expression then (E)* is a regular expression
  3. if E1 and E2 are regular expressions then (E1)(E2) and (E1)|(E2) are regular expressions
  4. nothing else is a regular expression!
Definition: If R is a regular expression over alphabet Σ, then L(R), the language defined by regular expression R, is
  1. {λ} if R = λ, { } if R = ∅, and {a} if R = a, where a ∈ Σ
  2. L(E)* if R = (E)* for some regular expression E
  3. L(E1)L(E2) (respectively L(E1) ∪ L(E2)) if R = (E1)(E2) (respectively R = (E1)|(E2)) whereE1 and E2 are regular expressions

Informal method for constructing FAs accepting languages defined by regular expressions
Hopefully you remember expression trees from data structures or parse trees from programming languages. We went over a demo in class about how we could use the expression tree representation of a regular expression and apply our union, Kleene star and concatenation algorithms to construct a NDFA that accepts the language defined by a regular expression.

Converting Regular Expressions to Finite Automata
Since the only operations on regular expressions are union concatenation and Kleene star, those are the only real algorithms we need for FAs in order to do our conversion. Here they are (pulled from previous lectures):

Algorithm: Union
Input: Machines M1 = (Q1,Σ1,Δ1,s1,W1) and M2 = (Q2,Σ2,Δ2,s2,W2), where Q1 ∩ Q2 = ∅
Output: M = (Q1 ∪ Q2 &cup {s}, Σ1 ∪ Σ2, Δ1 ∪ Δ2 ∪ {(s,λ,s1),(s,λ,s2)}, s, W1 ∪ W2), where s ∉ Q1 and s ∉ Q2 (i.e. s is a new state)

Algorithm: Concatenation
Input: Machines M1 = (Q1,Σ1,Δ1,s1,W1) and M2 = (Q2,Σ2,Δ2,s2,W2), where Q1 ∩ Q2 = ∅
Output: Machine M = (Q1 ∪ Q2, Σ1 ∪ Σ2, Δ, s1,W2), where Δ = Δ1 ∪ Δ2 ∪ {(q,λ,s2) | q ∈ W1}

Algorithm: KleeneStar
Input: Machine M = (Q, Σ, Δ, s, W)
Output: M' = (Q ∪ {$}, Σ, Δ ∪ {($,λ,s)} ∪ {(q,λ,$) | q ∈ W}, $, {$}), where $ ∉ Q, i.e. $ is a new state

Now, in addition to operations, regular expressions may contain a, where a ∈ Σ, λ and ∅. So we need algorithms for building those machines. This is really easy!

a ∈ Σ Machine λ Machine ∅ Machine
Input: a ∈ Σ
Output: M = ({q0,q1},Σ,{(q0,a,q1)},q0,{q1})
Input: nothing
Output: M = ({q0},Σ,{ },q0,{q0})
Input: nothing
Output: M = ({q0},Σ,{ },q0,{ })

Now we really have everything we need for a recursive algorithm to convert a regular expression λ into a finite automaton.

Algorithm: RE2FA
Input: regular expression E
Output: NDFA M such that L(M) = L(E).
if E is a, λ or ∅, return the a ∈ Σ Machine, λ Machine or ∅ Machine from above
else if E = E1 | E2 return Union(RE2FA(E1),RE2FA(E2))
else if E = E1E2 return Concatenation(RE2FA(E1),RE2FA(E2))
else if E = E1* return KleeneStar(RE2FA(E1))
So now we have an algorithm for converting regular expressions into finite automata. So what? Why do we care? Well we care because this is precisely what a utility like grep or Perl's regular expression facilities has to do in order to do pattern matching. They start with a user specified regular expression, convert to an NDFA, convert to a DFA, and convert that into a program. This program is what gets run on the input string.

Regular Exressions vs. Finite Automata
How does the set of all languages defined by regular expressions compare to the set of all languages accepted by finite automata? The previous algorithm tells us that the set of languages defined by regular expressions is contained within the set or languages accepted by finite automata. But what about the other way? Are there lanugages accepted by finite automata for which there is no regular expression? Do I care? Well, yes and no. No I don't care in the sense that I won't really want to convert finite automata into regular expressions. Yes I do care, however, because if there are languages that are accepted by finite automata but which cannot be represented by regular expressions, then I should extend regular expressions so that they take full advantage of the power of the finite automata into which we'll convert them.

Converting Finite Automata to Regular Expressions
Yes, any finite automaton can be converted into regular expression defining the language the automaton accepts. This means the set of all languages defined by regular expressions is equal to the set of all languages accepted by finite automata, so there's no point trying to extend the expressive power of regular expressions. Because of this equivalence, we refer to all these languages as regular languages.

The algorithm for this conversion is a little ugly. I'm not going to cover it in any great depth. The point is that it is possible ... it's not the kind of thing there's a great need to do in practice.

Input: NDFA M = (Q,Σ,Δ,s,W) containing no λ-transitions, (pa,x,pb) ∈ QxΣxQ where (pa,x,pb) ∉ Δ, and a table E such that Eij is a regular expression for {w ∈ Σ* | (qi,w) ⇒* (qj,λ)}.
Output: A table Enew such that Enewij is a regular expression for {w ∈ Σ* | in Mnew we have (qi,w)⇒*(qj,λ)} where Mnew = (Q,Σ,Δ∪{(pa,x,pb)},s,W)}.
  1. set Enewab = (Eaa)(x (Eba))*(Eab|x)(Ebb)
  2. for all (r,s) ≠ (a,b) set Enewrs = Ers|(Era)(Enewab)(Ebs)

We use this algorithm to build a table T such that T[p,q] is a regular expression defining the set of all strings that take you from state p to state q in the machine, i.e. all strings w such that (p,w)⇒*(q,λ). We build this table incrementally, by starting with a machine that is the same as the input, except all the transitions are missing. The table for this initial machine is clearly λs along the diagonal, i.e. T[p,p] = λ for each state p, and ∅ everywhere else, since there are no transitions to take you from one state to another. We add transitions one-by-one, updating the table along the way by the algorithm above. When all the transitions have been added, we create the final regular expression by taking the "or" of all table entries T[p,q] where p is the start state and q is an accepting state.

This algorithm works, but it creates horribly complicated regular expressions. I've implemented it, adding a little bit of simplification on the regular expressions that appear in the table. Even with that, the results are ugly. But, you do really get a regular expression at the end of the day that is equivalent to the input machine. The figure below shows the algorithm operating on input machine:

state 0state 1
state 0λ
state 1λ
state 0state 1
state 0λa
state 1λ
state 0state 1
state 0λa|b
state 1λ
state 0state 1
state 0(a|b)(a(a|b))*a|λ(a|b)(a(a|b))*a(a|b)|a|b
state 1(a(a|b))*a(a(a|b))*a(a|b)|λ
state 0state 1
state 0(a|b)(a(a|b))*a|λ|(((a|b)(a(a|b))*a(a|b)|a|b)(((a(a|b))*a(a|b)|λ)(b((a|b)(a(a|b))*a(a|b)|a|b))*(((a(a|b))*a)|b)((a|b)(a(a|b))*a|λ))((a|b)(a(a|b))*a|λ))(a|b)(a(a|b))*a(a|b)|a|b|(((a|b)(a(a|b))*a(a|b)|a|b)(((a(a|b))*a(a|b)|λ)(b((a|b)(a(a|b))*a(a|b)|a|b))*(((a(a|b))*a)|b)((a|b)(a(a|b))*a|λ))((a|b)(a(a|b))*a(a|b)|a|b))
state 1((a(a|b))*a(a|b)|λ)(b((a|b)(a(a|b))*a(a|b)|a|b))*(((a(a|b))*a)|b)((a|b)(a(a|b))*a|λ)(a(a|b))*a(a|b)|λ|(((a(a|b))*a(a|b)|λ)(((a(a|b))*a(a|b)|λ)(b((a|b)(a(a|b))*a(a|b)|a|b))*(((a(a|b))*a)|b)((a|b)(a(a|b))*a|λ))((a|b)(a(a|b))*a(a|b)|a|b))

To get our final answer, we note that since state 1 is the only accepting state, T[0,1] is our answer, i.e.

(a|b)(a(a|b))*a(a|b)|a|b|(((a|b)(a(a|b))*a(a|b)|a|b)(((a(a|b))*a(a|b)|λ)(b((a|b)(a(a|b))*a(a|b)|a|b))*(((a(a|b))*a)|b)((a|b)(a(a|b))*a|λ))((a|b)(a(a|b))*a(a|b)|a|b))
When you consider that (a|b)*(a|b) is also an equivalent regular expression, that looks pretty ugly. It is, however, equivalent. I actually checked!

Now here is finally a case where you're looking at an algorithm and probably thinking "why would I believe this works?" It's not clear that the table has the desired property, i.e. that T[p,q] define { w ∈ Σ*| (p,w) ⇒ (q,λ)}. Really we need to prove that in the algorithm described above the following is true:

if E is a table such that Eij is a regular expression for {w ∈ Σ* | (qi,w) ⇒*M (qj,λ)} then Enew is a table such that Enewij is a regular expression for {w ∈ Σ* | (qi,w)⇒*Mnew(qj,λ)} where Mnew = (Q,Σ,Δ∪{(pa,x,pb)},s,W)}.
This is not too bad to prove, but it requires the concept of inductive proof, which not all of you have seen at this point.


Christopher W Brown
Last modified: Fri Oct 9 11:13:28 EDT 2009