Tools like flex work a different way. As users
we specify tokens using regular expressions. Then flex
builds FA's for each regular expression and merges all the
resulting FA's. The resulting FA is turned into a DFA which
is then minimized. This is then converted into code ---
usually with a state-character table lookup. The advantage,
as I hope you saw in the lab, is that changes in the
language that
add or remove or modify tokens usually require minor changes
to a flex file. The major changes to the DFA that result
are all handled automatically.
There are two basic approaches to parsing: top-down or bottom-up. They primary difference is how they use the stack. In bottom-up parsing, which we discussed in Theory, symbols are pushed onto a stack and, when we recognize the right-hand side of a grammar rule on top of the stack (in reverse order, actually), we pop those symobls off the stack and push the non-terminal symbol on the right-hand side of the rule. In top-down parsing we pop a non-terminal off the stack, look for a rule with that symbol on the lef-hand side, then push the symbols comprising the righ-hand side of that rule onto the stack (in reverse order, actually). If a bottom-up parser successfully parses a string, we should finish with the start symbol on the stack. When a top-down parser parses a string it begins with the start symbol on the stack.
We discussed LL and LR. Look in the book to find the difference.
S -> T T T -> aa T -> bband walked through bottom-up and top-down parsing of a simple string like
bbaa.
We looked at the top-down in more detail, and what we really
noticed was that at some points we had a decision to
make. Namely, if you pop T from the stack, should
you choose the aa right-hand side or the
bb right-hand side. What we determined was that
making this decision requires peeking at the next input
symbol. In this case it suffices to simply peek at one
token. Grammars that we can parse top-down with just one
character of lookahead are called LL(1). In a grammar like
this:
S -> T T T -> ab T -> aajust one character of lookahead isn't enough. You need two! When two characters aren't enough, maybe you need three. In general, though, what we should try to do is rewrite the grammar so that the problem goes away. Here the problem is that there is a common prefix between two right-hand sides with the same left-hand side. The other enemy of LL(1)-ness is left recursion, like
exp -> exp OPA term. At any rate, we can rewrite
the above grammar as
S -> T T T -> a X X -> b X -> a... which generates the same language, but which is LL(1).