Tools like flex work a different way. As users
we specify tokens using regular expressions. Then flex
builds FA's for each regular expression and merges all the
resulting FA's. The resulting FA is turned into a DFA which
is then minimized. This is then converted into code ---
usually with a state-character table lookup. The advantage,
as I hope you saw in the lab, is that changes in the
language that
add or remove or modify tokens usually require minor changes
to a flex file. The major changes to the DFA that result
are all handled automatically.
This is not an entirely new problem for us. Recognizing strings in a language defined by a context-free grammar was something we said in Theory class we could do with a push-down automaton, i.e. a NDFA+stack. Of course in practice, we can't really accept non-determinism, so our struggle will be to use deterministic control along with a stack.
There are two basic approaches to parsing: top-down or bottom-up. The primary difference between them is how they use the stack. In bottom-up parsing, which we discussed in Theory, symbols are pushed onto a stack and, when we recognize the right-hand side of a grammar rule on top of the stack (in reverse order, actually), we pop those symobls off the stack and push the non-terminal symbol on the left-hand side of the rule. In top-down parsing we pop a non-terminal off the stack, look for a rule with that symbol on the left-hand side, then push the symbols comprising the right-hand side of that rule onto the stack (in reverse order, actually). If a bottom-up parser successfully parses a string, we should finish with the start symbol on the stack. When a top-down parser parses a string it begins with the start symbol on the stack. Consider the grammar $A \rightarrow abc$:
| Top down | Bottom up |
input: abc stack: A||
# pop A and push a b c in reverse order
abc stack: a b c||
# read a and pop a
bc stack: b c||
# read b and pop b
c stack: c||
# read c and pop c
stack: || |
input: abc stack: ||
# read a and push a
bc stack: a||
# read b and push b
c stack: b a||
# read c and push c
stack: c b a||
# pop a b c in reverse order and push A
stack: A|| |
When we have more than one rule in the grammar things get more interesting.
TOP DOWN (LL) left-to-right scan of input, leftmost derivation 1. Start with only start symbol on stack 2. End with empty stack 3. At each step, we either a. expand by a grammer rule b. match / consume input character Challenge: Which rule to expand by?!?!? |
BOTTOM UP (LR) left-to-right scan of input, rightmost derivation 1. Start with empty stack 2. End with only start symbol on stack 3. At each step, we either a. shift input character b. reduce by a grammar rule Challenge: Shift or reduce? Reduce by what?? |
S -> T S | ; T -> aa | bbADD DEMO NOTES and walked through bottom-up and top-down parsing of a simple string like
bbaa;.
We looked at the top-down in more detail, and what we really
noticed was that at some points we had a decision to
make. Namely, if you pop T from the stack, should
you choose the aa right-hand side or the
bb right-hand side. What we determined was that
making this decision requires peeking at the next input
symbol. In this case it suffices to simply peek at one
token. Grammars that we can parse top-down with just one
character of lookahead are called LL(1). In a grammar like
this:
S -> T S | ; T -> aa | abjust one character of lookahead isn't enough. You need two! When two characters aren't enough, maybe you need three. When faced with a grammar that is not LL(1), we should try to rewrite the grammar so that the problem goes away. That is, we should try to write a new grammar for the same language that is LL(1). Unfortunately, this is not always possible (check "Properties of deterministic top down grammars", Rosenkrantz and Stearns). However, often it is possible for the languages we care about for programming. However, it may take some work.
In the previous example grammar, the problem is
that there is a common prefix between two right-hand
sides with the same left-hand side. The right-hand sides of
S -> aa and S -> ab have the common
prefix "a". To see how this causes problems, imagine we are
trying to parse ab;. We start with stack
S||, and lookahead symbol "a". Since the
lookahead is not ";", we choose rule S -> T S and
pop S off the stack and push T S in reverse order, leaving us
with the stack T S||. Now, with T on top of the
stack, we need to choose between rule T -> aa and
rule T -> ab. The problem is that both rules are
consistent with our lookahead of "a", so we don't know which
to choose! Sometimes we can solve this by "factoring out" the
common prefix. In this example, we might do it this way:
S -> T S | ; T -> a T' T'-> a | bYou should be able to convince yourself that this grammar defines the same language as the previous one. Moreover, this grammar is LL(1). How do I know? Because if I know what's on the top of the stack, and I know the lookahead, it's now unambiguous which rule I want to expand by. You should be able to convince yourself of this. The other enemy of LL(1)-ness is left recursion, like in the rule from our usual arithmetic grammar:x
exp -> exp OPA term. Let's consider a simple
grammar with left-recursion:
S -> S @ X | X X -> a | bThe problem here is that if we have S on the top of the stack and the lookahead is "a" or "b", we don't know whether to expand with rule
S -> S @ X or rule S -> X. Our
first attempt to fix this is to turn the left-recursion into right
recursion. After all, we just have @-separated lists of a's and
b's. Do we really care whether the grammar generates them from
left to right or right to left?
S -> X @ S | X X -> a | bWe've fixed the left recursion, but know we have a common prefix problem! Specifically, both rules for S have the common prefix X. So, we fix that by factoring out the common prefix:
S -> X T T -> @ X T | epsilon X -> a | bAnd this is now LL(1). How can I tell? Well, if S is at the top of the stack, there's only one rule. If X is at the top of the stack, the lookahead will tell us if we want the rule that produces "a" or the rule that produces "b". So what if T is at the top of the stack? Well then either the lookahead is "@", in which case I expand by rule
T -> @ X T, or the
lookahead is "a" or "b", in which case we have an error, or the
lookahead is empty because we've come to the end of the input,
in which case we expand by T -> epsilon. This
solution to left recursion is really common, but usually we'd
call the new non-terminal "T" something like $S_{\text{tail}}$.
So the rules $S \rightarrow S @ X\ |\ X$ are replaced with
$S \rightarrow X S_{\text{tail}}$, and we add rules
$S_{\text{tail}} \rightarrow @ X S_{\text{tail}}\ |\ \epsilon$.