SI340: Regular Expressions and Finite Automata

Reading: Section 2.4 of Theory of Computing, A Gentle Introduction and, of course, these notes. Make sure you know what a "regular language" is!
Homework: Repeat to yourself 10 times: Any lanugage accepted by a DFA, can be accepted by an NDFA. Any language accepted by an NDFA can be defined by a Regular Expression. Any language defined by a Regular Expression can be accepted by an NDFA. Any language accepted by an NDFA can be accepted by a DFA. Then printout the homework and answer the questions on that paper.

Repeat: Formal definition of a regular expression

Now that we are (hopefully) familiar with using regular expressions in an informal way, it's time to give a formal defintions of a regular expression

definition of regular expression definition of language defined by a regular expression

Definition: A regular expression over the alphabet Σ is an expression derived from the following rules:

λ, ∅, and a, where a ∈ Σ, are all regular expressions

if E is a regular expression then (E)* is a regular expression

if E1 and E2 are regular expressions then (E1)(E2) and (E1)|(E2) are regular expressions

nothing else is a regular expression!

Definition: If R is a regular expression over alphabet Σ, then L(R), the language defined by regular expression R, is

{λ} if R = λ, { } if R = ∅, and {a} if R = a, where a ∈ Σ

L(E)* if R = (E)* for some regular expression E

L(E1)L(E2) (respectively L(E1) ∪ L(E2)) if R = (E1)(E2) (respectively R = (E1)|(E2)) whereE1 and E2 are regular expressions

Informal method for constructing FAs accepting languages defined by regular expressions

Hopefully you remember expression trees from data structures or parse trees from programming languages (if not, no fear, we described them in class). We went over a demo in class about how we could use the expression tree representation of a regular expression and apply our union, Kleene star and concatenation algorithms to construct a NDFA that accepts the language defined by a regular expression.

Converting Regular Expressions to Finite Automata

Since the only operations on regular expressions are union concatenation and Kleene star, those are the only real algorithms we need for FAs in order to do our conversion. Here they are (pulled from previous lectures):

Algorithm: Union: Input: Machines M1 = (Q1,Σ1,Δ1,s1,W1) and M2 = (Q2,Σ2,Δ2,s2,W2), where Q1 ∩ Q2 = ∅
Output: M = (Q1 ∪ Q2 ∪ {s}, Σ1 ∪ Σ2, Δ1 ∪ Δ2 ∪ {(s,λ,s1),(s,λ,s2)}, s, W1 ∪ W2), where s ∉ Q1 and s ∉ Q2 (i.e. s is a new state)
Algorithm: Concatenation: Input: Machines M1 = (Q1,Σ1,Δ1,s1,W1) and M2 = (Q2,Σ2,Δ2,s2,W2), where Q1 ∩ Q2 = ∅
Output: Machine M = (Q1 ∪ Q2, Σ1 ∪ Σ2, Δ, s1,W2), where Δ = Δ1 ∪ Δ2 ∪ {(q,λ,s2) | q ∈ W1}
Algorithm: KleeneStar: Input: Machine M = (Q, Σ, Δ, s, W)
Output: $M' = (Q \cup \{\$\}, \Sigma, \Delta \cup \{(\$,\lambda,s)\} \cup \{(q,\lambda,\$)\ |\ q \in W\}, \$, \{\$\})$, where $\$ \notin Q$, i.e. $\$$ is a new state

Now, in addition to operations, regular expressions may contain a, where a ∈ Σ, λ and ∅. So we need algorithms for building those machines. This is really easy!

a ∈ Σ Machine	λ Machine	∅ Machine
Input: a ∈ Σ Output: M = ({q0,q1},Σ,{(q0,a,q1)},q0,{q1})	Input: nothing Output: M = ({q0},Σ,{ },q0,{q0})	Input: nothing Output: M = ({q0},Σ,{ },q0,{ })

Now we really have everything we need for a recursive algorithm to convert a regular expression λ into a finite automaton.

Algorithm: RE2FA
Input: regular expression E
Output: NDFA M such that L(M) = L(E).
if E is a, λ or ∅, return the a ∈ Σ Machine, λ Machine or ∅ Machine from above
else if E = E1 | E2 return Union(RE2FA(E1),RE2FA(E2))
else if E = E1E2 return Concatenation(RE2FA(E1),RE2FA(E2))
else if E = E1* return KleeneStar(RE2FA(E1))

So now we have an algorithm for converting regular expressions into finite automata. So what? Why do we care? Well we care because this is precisely what a utility like grep or Perl's regular expression facilities has to do in order to do pattern matching. They start with a user specified regular expression, convert to an NDFA, convert to a DFA, and convert that into a program. This program is what gets run on the input string.

Regular Exressions vs. Finite Automata

How does the set of all languages defined by regular expressions compare to the set of all languages accepted by finite automata? The previous algorithm tells us that the set of languages defined by regular expressions is contained within the set or languages accepted by finite automata. But what about the other way? Are there lanugages accepted by finite automata for which there is no regular expression? Do I care? Well, yes and no. No I don't care in the sense that I won't really want to convert finite automata into regular expressions. Yes I do care, however, because if there are languages that are accepted by finite automata but which cannot be represented by regular expressions, then I should extend regular expressions so that they take full advantage of the power of the finite automata into which we'll convert them.

Converting Finite Automata to Regular Expressions

Yes, any finite automaton can be converted into regular expression defining the language the automaton accepts. This means the set of all languages defined by regular expressions is equal to the set of all languages accepted by finite automata, so there's no point trying to extend the expressive power of regular expressions. Because of this equivalence, we refer to all these languages as regular languages.

The algorithm we'll look at for this conversion is based on the idea of a diagram like an NFA, call it a "RE-NFA", which can have regular expressions as labels for the transitions. (Actually such a diagram could be made to operate as a machine like an NFA, but we won't deal with that here---we'll only deal with it as a diagram). With this RE-NFA in mind, the approach is to start with any NFA, and "collapse" it into a RE-NFA that has only a single transition, with the goal being that this final RE-transition is a Regular Expression capturing precisely the language of the original NFA.

Our algorithm takes as input an NFA with a couple of restrictions:

There can only be one accepting state.
There can be no transitions into the start state.
There can be no transitions out of the final state.

These restrictions are not really a problem, since for any NFA that violates one of these, we can easily build a new NFA that accepts the same language but meets all the criteria above by creating dummy state(s) with appropriate λ-transitions. Since there is only one final state, we'll call it $q_{n-1}$.

A note on notation: since transitions in these RE-NFAs may be regular expressions, the label for a transition from a state $q_p$ to a state $q_q$ will be the symbol $E_{p, q}$, which may represent any RE as opposed to an individual symbol in $\Sigma$.

Algorithm: FA2RE
Input: NDFA $M = (Q,\Sigma,\Delta,s,W)$ such that:

$Q = \{q_0, ..., q_{n-1}\}$,

For each $(q_a, x, q_b) \in \Delta$, $q_a \neq q_{n-1}$ and $q_b \neq q_0$,

$s = q_0$,

$W = \{q_{n-1}\}$.

Output: A regular expression for $L(M)$.
For any pair of states $q_j$ and $q_k$ (including when $j = k$), if there are $m$ arrows with labels $E_{j,k}^1, E_{j,k}^2, ..., E_{j,k}^m$, then replace them with a single arrow labeled with $E_{j, k} = (E_{j,k}^1 | E_{j,k}^2 | ... | E_{j,k}^m)$.
for $i = 1, 2, ..., n-2$ do
for every pair of states $q_j$ and $q_k$ (including when $j = k$) such that there is an arrow from $q_j$ to $q_i$ and an arrow from $q_i$ to $q_k$:
if there is no arrow from $q_i$ to $q_i$, then add an arrow from $q_j$ to $q_k$ labeled by $E_{j,i} E_{i,k}$.
else (i.e. there is an arrow from $q_i$ to $q_i$), add an arrow from $q_j$ to $q_k$ labeled by $E_{j,i} (E_{i,i})^* E_{i,k}$.

If there are arrows from $q_j$ to $q_k$ with labels $E_{j,k}^1, E_{j,k}^2, ..., E_{j,k}^m$, replace them by a single arrow with label $(E_{j,k}^1 | E_{j,k}^2 | ... | E_{j,k}^m)$.

Remove node $q_i$ and all the arrows going in and out of it.

Return the only remaining label, namely, $E_{0, n-1}$.

The following images depict the two possibilities for the conditional in step 1 of the loop. Which rewrite rule will be used depends on whether $q_i$ has an arrow to itself. If it does not, we just concatenate the labels of the arrow in and arrow out:

If it does, then we just need to add the Kleene-star of the label of the arrow $q_i$ has to itself to the middle of our new label:

Example. Suppose we take for our input a simple DFA accepting all strings over the alphabet $\Sigma = \{a, b\}$ which have odd length:

Since this DFA violates two of our input restrictions (a transition into the start state and a transition out of the final state), we'll start by adding a couple extra states with $\lambda$-transitions. We can also accomplish the first step of the algorithm, which is to simply rewrite the labels "a, b" with alternation REs such as "(a|b)".

Now we are ready for the main loop, which has only two iterations:

Notice that there are two applications of the transition rewrite rule, both of which are "Case 1", since $q_1$ does not have any transitions to itself. The two paths needing to be rewritten are $q_0$ to $q_1$ to $q_2$, and $q_2$ to $q_1$ to $q_2$. The latter of these creates a self-transition at $q_2$. Now we are ready to remove $q_2$, and of course this time it will be a "Case 2" rewrite because of the self-transition we just created:

And we are done! Our regular expression is (a|b)((a|b)(a|b))*, which indeed is a valid regular expression for the language accepted by this machine.

Christopher W Brown

Last modified: Thu Jun 28 18:03:28 EDT 2018

definition of regular expression	definition of language defined by a regular expression
Definition: A regular expression over the alphabet Σ is an expression derived from the following rules: λ, ∅, and a, where a ∈ Σ, are all regular expressions if E is a regular expression then (E)* is a regular expression if E1 and E2 are regular expressions then (E1)(E2) and (E1)\|(E2) are regular expressions nothing else is a regular expression!	Definition: If R is a regular expression over alphabet Σ, then L(R), the language defined by regular expression R, is {λ} if R = λ, { } if R = ∅, and {a} if R = a, where a ∈ Σ L(E)* if R = (E)* for some regular expression E L(E1)L(E2) (respectively L(E1) ∪ L(E2)) if R = (E1)(E2) (respectively R = (E1)\|(E2)) whereE1 and E2 are regular expressions