Class 18: Regular Expressions and Finite Automata

Reading: Section 2.4 of Theory of Computing, A Gentle Introduction and, of course, these notes. Make sure you know what a "regular language" is!
Homework: Repeat to yourself 10 times: Any lanugage accepted by a DFA, can be accepted by an NDFA. Any language accepted by an NDFA can be defined by a Regular Expression. Any language defined by a Regular Expression can be accepted by an NDFA. Any language accepted by an NDFA can be accepted by a DFA. Then printout the homework and answer the questions on that paper.

Repeat: Formal definition of a regular expression

Now that we are (hopefully) familiar with using regular expressions in an informal way, it's time to give a formal defintions of a regular expression

definition of regular expression definition of language defined by a regular expression

Definition: A regular expression over the alphabet Σ is an expression derived from the following rules:

λ, ∅, and a, where a ∈ Σ, are all regular expressions

if E is a regular expression then (E)* is a regular expression

if E1 and E2 are regular expressions then (E1)(E2) and (E1)|(E2) are regular expressions

nothing else is a regular expression!

Definition: If R is a regular expression over alphabet Σ, then L(R), the language defined by regular expression R, is

{λ} if R = λ, { } if R = ∅, and {a} if R = a, where a ∈ Σ

L(E)* if R = (E)* for some regular expression E

L(E1)L(E2) (respectively L(E1) ∪ L(E2)) if R = (E1)(E2) (respectively R = (E1)|(E2)) whereE1 and E2 are regular expressions

Informal method for constructing FAs accepting languages defined by regular expressions

Hopefully you remember expression trees from data structures or parse trees from programming languages. We went over a demo in class about how we could use the expression tree representation of a regular expression and apply our union, Kleene star and concatenation algorithms to construct a NDFA that accepts the language defined by a regular expression.

Converting Regular Expressions to Finite Automata

Since the only operations on regular expressions are union concatenation and Kleene star, those are the only real algorithms we need for FAs in order to do our conversion. Here they are (pulled from previous lectures):

Algorithm: Union: Input: Machines M1 = (Q1,Σ1,Δ1,s1,W1) and M2 = (Q2,Σ2,Δ2,s2,W2), where Q1 ∩ Q2 = ∅
Output: M = (Q1 ∪ Q2 &cup {s}, Σ1 ∪ Σ2, Δ1 ∪ Δ2 ∪ {(s,λ,s1),(s,λ,s2)}, s, W1 ∪ W2), where s ∉ Q1 and s ∉ Q2 (i.e. s is a new state)
Algorithm: Concatenation: Input: Machines M1 = (Q1,Σ1,Δ1,s1,W1) and M2 = (Q2,Σ2,Δ2,s2,W2), where Q1 ∩ Q2 = ∅
Output: Machine M = (Q1 ∪ Q2, Σ1 ∪ Σ2, Δ, s1,W2), where Δ = Δ1 ∪ Δ2 ∪ {(q,λ,s2) | q ∈ W1}
Algorithm: KleeneStar: Input: Machine M = (Q, Σ, Δ, s, W)
Output: M' = (Q ∪ {$}, Σ, Δ ∪ {($,λ,s)} ∪ {(q,λ,$) | q ∈ W}, $, {$}), where $ ∉ Q, i.e. $ is a new state

Now, in addition to operations, regular expressions may contain a, where a ∈ Σ, λ and ∅. So we need algorithms for building those machines. This is really easy!

a ∈ Σ Machine λ Machine ∅ Machine

Input: a ∈ Σ
Output: M = ({q0,q1},Σ,{(q0,a,q1)},q0,{q1}) Input: nothing
Output: M = ({q0},Σ,{ },q0,{q0}) Input: nothing
Output: M = ({q0},Σ,{ },q0,{ })

Now we really have everything we need for a recursive algorithm to convert a regular expression λ into a finite automaton.

Algorithm: RE2FA
Input: regular expression E
Output: NDFA M such that L(M) = L(E).
if E is a, λ or ∅, return the a ∈ Σ Machine, λ Machine or ∅ Machine from above
else if E = E1 | E2 return Union(RE2FA(E1),RE2FA(E2))
else if E = E1E2 return Concatenation(RE2FA(E1),RE2FA(E2))
else if E = E1* return KleeneStar(RE2FA(E1))

So now we have an algorithm for converting regular expressions into finite automata. So what? Why do we care? Well we care because this is precisely what a utility like grep or Perl's regular expression facilities has to do in order to do pattern matching. They start with a user specified regular expression, convert to an NDFA, convert to a DFA, and convert that into a program. This program is what gets run on the input string.

Regular Exressions vs. Finite Automata

How does the set of all languages defined by regular expressions compare to the set of all languages accepted by finite automata? The previous algorithm tells us that the set of languages defined by regular expressions is contained within the set or languages accepted by finite automata. But what about the other way? Are there lanugages accepted by finite automata for which there is no regular expression? Do I care? Well, yes and no. No I don't care in the sense that I won't really want to convert finite automata into regular expressions. Yes I do care, however, because if there are languages that are accepted by finite automata but which cannot be represented by regular expressions, then I should extend regular expressions so that they take full advantage of the power of the finite automata into which we'll convert them.

Converting Finite Automata to Regular Expressions

Yes, any finite automaton can be converted into regular expression defining the language the automaton accepts. This means the set of all languages defined by regular expressions is equal to the set of all languages accepted by finite automata, so there's no point trying to extend the expressive power of regular expressions. Because of this equivalence, we refer to all these languages as regular languages.

The algorithm for this conversion is a little ugly. I'm not going to cover it in any great depth. The point is that it is possible ... it's not the kind of thing there's a great need to do in practice.

Input: NDFA M = (Q,Σ,Δ,s,W) containing no λ-transitions, (p_a,x,p_b) ∈ QxΣxQ where (p_a,x,p_b) ∉ Δ, and a table E such that E_ij is a regular expression for {w ∈ Σ* | (q_i,w) ⇒* (q_j,λ)}.
Output: A table E^new such that E^new_ij is a regular expression for {w ∈ Σ* | in M^new we have (q_i,w)⇒*(q_j,λ)} where M^new = (Q,Σ,Δ∪{(p_a,x,p_b)},s,W)}.

set E^new_ab = (E_aa)(x (E_ba))*(E_ab|x)(E_bb)

for all (r,s) ≠ (a,b) set E^new_rs = E_rs|(E_ra)(E^new_ab)(E_bs)

We use this algorithm to build a table T such that T[p,q] is a regular expression defining the set of all strings that take you from state p to state q in the machine, i.e. all strings w such that (p,w)⇒*(q,λ). We build this table incrementally, by starting with a machine that is the same as the input, except all the transitions are missing. The table for this initial machine is clearly λs along the diagonal, i.e. T[p,p] = λ for each state p, and ∅ everywhere else, since there are no transitions to take you from one state to another. We add transitions one-by-one, updating the table along the way by the algorithm above. When all the transitions have been added, we create the final regular expression by taking the "or" of all table entries T[p,q] where p is the start state and q is an accepting state.

This algorithm works, but it creates horribly complicated regular expressions. I've implemented it, adding a little bit of simplification on the regular expressions that appear in the table. Even with that, the results are ugly. But, you do really get a regular expression at the end of the day that is equivalent to the input machine. The figure below shows the algorithm operating on input machine:

	state 0	state 1
state 0	λ	∅
state 1	∅	λ

	state 0	state 1
state 0	λ	a
state 1	∅	λ

	state 0	state 1
state 0	λ	a\|b
state 1	∅	λ

	state 0	state 1
state 0	(a\|b)(a(a\|b))*a\|λ	(a\|b)(a(a\|b))*a(a\|b)\|a\|b
state 1	(a(a\|b))*a	(a(a\|b))*a(a\|b)\|λ

	state 0	state 1
state 0	(a\|b)(a(a\|b))a\|λ\|(((a\|b)(a(a\|b))a(a\|b)\|a\|b)(((a(a\|b))a(a\|b)\|λ)(b((a\|b)(a(a\|b))a(a\|b)\|a\|b))(((a(a\|b))a)\|b)((a\|b)(a(a\|b))a\|λ))((a\|b)(a(a\|b))a\|λ))	(a\|b)(a(a\|b))a(a\|b)\|a\|b\|(((a\|b)(a(a\|b))a(a\|b)\|a\|b)(((a(a\|b))a(a\|b)\|λ)(b((a\|b)(a(a\|b))a(a\|b)\|a\|b))(((a(a\|b))a)\|b)((a\|b)(a(a\|b))a\|λ))((a\|b)(a(a\|b))a(a\|b)\|a\|b))
state 1	((a(a\|b))a(a\|b)\|λ)(b((a\|b)(a(a\|b))a(a\|b)\|a\|b))(((a(a\|b))a)\|b)((a\|b)(a(a\|b))*a\|λ)	(a(a\|b))a(a\|b)\|λ\|(((a(a\|b))a(a\|b)\|λ)(((a(a\|b))a(a\|b)\|λ)(b((a\|b)(a(a\|b))a(a\|b)\|a\|b))(((a(a\|b))a)\|b)((a\|b)(a(a\|b))a\|λ))((a\|b)(a(a\|b))a(a\|b)\|a\|b))

To get our final answer, we note that since state 1 is the only accepting state, T[0,1] is our answer, i.e.

(a|b)(a(a|b))*a(a|b)|a|b|(((a|b)(a(a|b))*a(a|b)|a|b)(((a(a|b))*a(a|b)|λ)(b((a|b)(a(a|b))*a(a|b)|a|b))*(((a(a|b))*a)|b)((a|b)(a(a|b))*a|λ))((a|b)(a(a|b))*a(a|b)|a|b))

When you consider that (a|b)*(a|b) is also an equivalent regular expression, that looks pretty ugly. It is, however, equivalent. I actually checked!

Now here is finally a case where you're looking at an algorithm and probably thinking "why would I believe this works?" It's not clear that the table has the desired property, i.e. that T[p,q] define { w ∈ Σ*| (p,w) ⇒ (q,λ)}. Really we need to prove that in the algorithm described above the following is true:

if E is a table such that E_ij is a regular expression for {w ∈ Σ* | (q_i,w) ⇒*_M (q_j,λ)} then E^new is a table such that E^new_ij is a regular expression for {w ∈ Σ* | (q_i,w)⇒*_M^new(q_j,λ)} where M^new = (Q,Σ,Δ∪{(p_a,x,p_b)},s,W)}.

This is not too bad to prove, but it requires the concept of inductive proof, which not all of you have seen at this point.

Christopher W Brown

Last modified: Fri Oct 9 11:13:28 EDT 2009

definition of regular expression	definition of language defined by a regular expression
Definition: A regular expression over the alphabet Σ is an expression derived from the following rules: λ, ∅, and a, where a ∈ Σ, are all regular expressions if E is a regular expression then (E)* is a regular expression if E1 and E2 are regular expressions then (E1)(E2) and (E1)\|(E2) are regular expressions nothing else is a regular expression!	Definition: If R is a regular expression over alphabet Σ, then L(R), the language defined by regular expression R, is {λ} if R = λ, { } if R = ∅, and {a} if R = a, where a ∈ Σ L(E)* if R = (E)* for some regular expression E L(E1)L(E2) (respectively L(E1) ∪ L(E2)) if R = (E1)(E2) (respectively R = (E1)\|(E2)) whereE1 and E2 are regular expressions

a ∈ Σ Machine	λ Machine	∅ Machine
Input: a ∈ Σ Output: M = ({q0,q1},Σ,{(q0,a,q1)},q0,{q1})	Input: nothing Output: M = ({q0},Σ,{ },q0,{q0})	Input: nothing Output: M = ({q0},Σ,{ },q0,{ })