SI413 Class 8: Top-down and Bottom-up parsing

Section 2.3.1 (+ Section 2.3 intro) of Programming Language Pragmatics 3rd Ed.

Problems 20-22,24-26 on p. 74 of Programming Language Pragmatics 3rd Ed.

HW Review
We looked at solutions to some of the problems due today.

Reprising scanning
Recall that if we're writing a scanner by hand, we'll usually draw out a detemrinistic finite automaton (DFA) that recognizes the tokens, then implement that automaton in code ... perhaps with nexted switch/case statements. If we make a lanugage change that adds or removes or modifies tokens, it may have a big effect on the DFA, and thus require substantial rewriting of the code that implements the DFA

Tools like flex work a different way. As users we specify tokens using regular expressions. Then flex builds FA's for each regular expression and merges all the resulting FA's. The resulting FA is turned into a DFA which is then minimized. This is then converted into code --- usually with a state-character table lookup. The advantage, as I hope you saw in the lab, is that changes in the language that add or remove or modify tokens usually require minor changes to a flex file. The major changes to the DFA that result are all handled automatically.

Parsing is the process of taking a grammar and a string of tokens and producing, either implicitly or explicitly a derivation for the token-string in the grammar. Of course if no such derivation is possible, we have a parse error. This is not an entirely new problem for us. Recognizing strings in a language defined by a context-free grammar was something we said in Theory clas we could do with a push-down automaton, i.e. a NDFA+stack. Of course in practice, we can't really accept non-determinism, so our struggle will be to use deterministic control along with a stack.

There are two basic approaches to parsing: top-down or bottom-up. They primary difference is how they use the stack. In bottom-up parsing, which we discussed in Theory, symbols are pushed onto a stack and, when we recognize the right-hand side of a grammar rule on top of the stack (in reverse order, actually), we pop those symobls off the stack and push the non-terminal symbol on the right-hand side of the rule. In top-down parsing we pop a non-terminal off the stack, look for a rule with that symbol on the lef-hand side, then push the symbols comprising the righ-hand side of that rule onto the stack (in reverse order, actually). If a bottom-up parser successfully parses a string, we should finish with the start symbol on the stack. When a top-down parser parses a string it begins with the start symbol on the stack.

We discussed LL and LR. Look in the book to find the difference.

Top-down / bottom-up demo
In class we looked at a very simple grammar:
S -> T T
T -> aa
T -> bb
and walked through bottom-up and top-down parsing of a simple string like bbaa. We looked at the top-down in more detail, and what we really noticed was that at some points we had a decision to make. Namely, if you pop T from the stack, should you choose the aa right-hand side or the bb right-hand side. What we determined was that making this decision requires peeking at the next input symbol. In this case it suffices to simply peek at one token. Grammars that we can parse top-down with just one character of lookahead are called LL(1). In a grammar like this:
S -> T T
T -> ab
T -> aa
just one character of lookahead isn't enough. You need two! When two characters aren't enough, maybe you need three. In general, though, what we should try to do is rewrite the grammar so that the problem goes away. Here the problem is that there is a common prefix between two right-hand sides with the same left-hand side. The other enemy of LL(1)-ness is left recursion, like exp -> exp OPA term. At any rate, we can rewrite the above grammar as
S -> T T
T -> a X
X -> b
X -> a
... which generates the same language, but which is LL(1).

Christopher W Brown
Last modified: Wed Sep 16 15:41:54 EDT 2009