definition of regular expression definition of language defined by a regular expression Definition: A regular expression over the alphabet Σ is an expression derived from the following rules:
- λ, ∅, and a, where a ∈ Σ, are all regular expressions
- if E is a regular expression then (E)* is a regular expression
- if E1 and E2 are regular expressions then (E1)(E2) and (E1)|(E2) are regular expressions
- nothing else is a regular expression!
Definition: If R is a regular expression over alphabet Σ, then L(R), the language defined by regular expression R, is
- {λ} if R = λ, { } if R = ∅, and {a} if R = a, where a ∈ Σ
- L(E)* if R = (E)* for some regular expression E
- L(E1)L(E2) (respectively L(E1) ∪ L(E2)) if R = (E1)(E2) (respectively R = (E1)|(E2)) whereE1 and E2 are regular expressions
| a ∈ Σ Machine | λ Machine | ∅ Machine |
|
Input: a ∈ Σ Output: M = ({q0,q1},Σ,{(q0,a,q1)},q0,{q1}) |
Input: nothing Output: M = ({q0},Σ,{ },q0,{q0}) |
Input: nothing Output: M = ({q0},Σ,{ },q0,{ }) |
![]() |
![]() |
![]() |
Now we really have everything we need for a recursive algorithm to convert a regular expression λ into a finite automaton.
Algorithm: RE2FASo now we have an algorithm for converting regular expressions into finite automata. So what? Why do we care? Well we care because this is precisely what a utility like grep or Perl's regular expression facilities has to do in order to do pattern matching. They start with a user specified regular expression, convert to an NDFA, convert to a DFA, and convert that into a program. This program is what gets run on the input string.
Input: regular expression E
Output: NDFA M such that L(M) = L(E).
if E is a, λ or ∅, return the a ∈ Σ Machine, λ Machine or ∅ Machine from above
else if E = E1 | E2 return Union(RE2FA(E1),RE2FA(E2))
else if E = E1E2 return Concatenation(RE2FA(E1),RE2FA(E2))
else if E = E1* return KleeneStar(RE2FA(E1))
The algorithm we'll look at for this conversion is based on the idea of a diagram like an NFA, call it a "RE-NFA", which can have regular expressions as labels for the transitions. (Actually such a diagram could be made to operate as a machine like an NFA, but we won't deal with that here---we'll only deal with it as a diagram). With this RE-NFA in mind, the approach is to start with any NFA, and "collapse" it into a RE-NFA that has only a single transition, with the goal being that this final RE-transition is a Regular Expression capturing precisely the language of the original NFA.
Our algorithm takes as input an NFA with a couple of restrictions:
A note on notation: since transitions in these RE-NFAs may be regular expressions, the label for a transition from a state $q_p$ to a state $q_q$ will be the symbol $E_{p, q}$, which may represent any RE as opposed to an individual symbol in $\Sigma$.
Algorithm: FA2REThe following images depict the two possibilities for the conditional in step 1 of the loop. Which rewrite rule will be used depends on whether $q_i$ has an arrow to itself. If it does not, we just concatenate the labels of the arrow in and arrow out:
Input: NDFA $M = (Q,\Sigma,\Delta,s,W)$ such that:Output: A regular expression for $L(M)$.
- $Q = \{q_0, ..., q_{n-1}\}$,
- For each $(q_a, x, q_b) \in \Delta$, $q_a \neq q_{n-1}$ and $q_b \neq q_0$,
- $s = q_0$,
- $W = \{q_{n-1}\}$.
For any pair of states $q_j$ and $q_k$ (including when $j = k$), if there are $m$ arrows with labels $E_{j,k}^1, E_{j,k}^2, ..., E_{j,k}^m$, then replace them with a single arrow labeled with $E_{j, k} = (E_{j,k}^1 | E_{j,k}^2 | ... | E_{j,k}^m)$.
for $i = 1, 2, ..., n-2$ dofor every pair of states $q_j$ and $q_k$ (including when $j = k$) such that there is an arrow from $q_j$ to $q_i$ and an arrow from $q_i$ to $q_k$:Return the only remaining label, namely, $E_{0, n-1}$.
- if there is no arrow from $q_i$ to $q_i$, then add an arrow from $q_j$ to $q_k$ labeled by $E_{j,i} E_{i,k}$.
else (i.e. there is an arrow from $q_i$ to $q_i$), add an arrow from $q_j$ to $q_k$ labeled by $E_{j,i} (E_{i,i})^* E_{i,k}$.- If there are arrows from $q_j$ to $q_k$ with labels $E_{j,k}^1, E_{j,k}^2, ..., E_{j,k}^m$, replace them by a single arrow with label $(E_{j,k}^1 | E_{j,k}^2 | ... | E_{j,k}^m)$.
- Remove node $q_i$ and all the arrows going in and out of it.