SI413: Top-down and Bottom-up parsing

Reading

Section 2.3 up to 2.3.1 of Programming Language Pragmatics.

Homework

Do the hw.

Reprising scanning

Recall that if we're writing a scanner by hand, we'll usually draw out a detemrinistic finite automaton (DFA) that recognizes the tokens, then implement that automaton in code ... perhaps with nested switch/case statements. If we make a lanugage change that adds or removes or modifies tokens, it may have a big effect on the DFA, and thus require substantial rewriting of the code that implements the DFA

Tools like flex work a different way. As users we specify tokens using regular expressions. Then flex builds FA's for each regular expression and merges all the resulting FA's. The resulting FA is turned into a DFA which is then minimized. This is then converted into code --- usually with a state-character table lookup. The advantage, as I hope you saw in the lab, is that changes in the language that add or remove or modify tokens usually require minor changes to a flex file. The major changes to the DFA that result are all handled automatically.

Parsing

For our purposes, we will define parsing as follows:

Definition: To parse string s with respect to grammar G means to traverse the/a parse tree for s, if s is in L(G), and to produce an error message otherwise.

It may seem to you that it would have been more obivous to define parsing as producing a parse tree, but there are cases in which we don't need an explicit representation in memory of the parse tree, so we opt for "traverse the parse tree" instead. For example, if all you wanted to do was determine whether the input string is in L(G), there's no reason to create a parse tree data structure. Note that if you took the appropriate actions while traversing the parse tree, you could produce an explicit data structure for the parse tree, so we haven't really lost anything with this defintion.

This is not an entirely new problem for us. Recognizing strings in a language defined by a context-free grammar was something we said in Theory class we could do with a push-down automaton, i.e. a NDFA+stack. Of course in practice, we can't really accept non-determinism, so our struggle will be to use deterministic control along with a stack.

There are two basic approaches to parsing: top-down or bottom-up. The primary difference between them is how they use the stack. In bottom-up parsing, which we discussed in Theory, symbols are pushed onto a stack and, when we recognize the right-hand side of a grammar rule on top of the stack (in reverse order, actually), we pop those symobls off the stack and push the non-terminal symbol on the left-hand side of the rule. In top-down parsing we pop a non-terminal off the stack, look for a rule with that symbol on the left-hand side, then push the symbols comprising the right-hand side of that rule onto the stack (in reverse order, actually). If a bottom-up parser successfully parses a string, we should finish with the start symbol on the stack. When a top-down parser parses a string it begins with the start symbol on the stack. Consider the grammar $A \rightarrow abc$:

Top down

Bottom up

input: abc        stack:         A||	  
                                     # pop A and push a b c in reverse order
       abc        stack:     a b c|| 
                                     # read a and pop a
        bc        stack:       b c||
                                     # read b and pop b
         c        stack:         c||
                                     # read c and pop c
                  stack:          ||

input: abc        stack:          ||	  
                                     # read a and push a
        bc        stack:         a||
                                     # read b and push b
         c        stack:       b a||
                                     # read c and push c
                  stack:     c b a||
                                     # pop a b c in reverse order and push A
                  stack:         A||

When we have more than one rule in the grammar things get more interesting.

In the top down approach, when the top of the stack is a terminal we know what to do: pop it, read the next character: if they are not the same throw an error, otherwise continue. When the top of the stack is a non-terminal $X$, we pop it, find a grammer rule $X\rightarrow w$, and push the symbols in $w$ in reverse order. But when there are multiple such rules, which do we use? This is the central problem we must solve to implement top-down parsing.
In the bottom up approach, at any given point in time we have two options: read an input symbol and push it onto the stack (this is referred to as shifting the symbol), or find a rule $X \rightarrow w$ such that the top of the stack contains the characters of $w$ in reverse order, then pop those symbols and push $X$ in their place (this is referred to as reducing by the rule $X \rightarrow w$. But which to do: shift or reduce? And if you've decided to reduce and there are multiple grammar rules that we could reduce by given the current stack, which to choose? These are the central problems we must solve to implement bottom up parsing.

We discussed LL and LR. Look in the book to find the difference.

TOP DOWN (LL) left-to-right scan of input, leftmost derivation  

1. Start with only start symbol on stack  
2. End with empty	stack
3. At each step, we either
   a. expand by a	grammer	rule
   b. match / consume input character

Challenge: Which rule to expand by?!?!?

BOTTOM UP (LR) left-to-right scan of input, rightmost derivation  

1. Start with empty stack
2. End with only start symbol on stack
3. At each step, we either
   a. shift input character
   b. reduce by a grammar rule

Challenge: Shift or reduce? Reduce by what??

Top-down / bottom-up demo

In class we looked at a very simple grammar:

S -> T S | ;
T -> aa | bb

ADD DEMO NOTES and walked through bottom-up and top-down parsing of a simple string like bbaa;. We looked at the top-down in more detail, and what we really noticed was that at some points we had a decision to make. Namely, if you pop T from the stack, should you choose the aa right-hand side or the bb right-hand side. What we determined was that making this decision requires peeking at the next input symbol. In this case it suffices to simply peek at one token. Grammars that we can parse top-down with just one character of lookahead are called LL(1). In a grammar like this:

S -> T S | ;
T -> aa | ab

just one character of lookahead isn't enough. You need two! When two characters aren't enough, maybe you need three. When faced with a grammar that is not LL(1), we should try to rewrite the grammar so that the problem goes away. That is, we should try to write a new grammar for the same language that is LL(1). Unfortunately, this is not always possible (check "Properties of deterministic top down grammars", Rosenkrantz and Stearns). However, often it is possible for the languages we care about for programming. However, it may take some work.

In the previous example grammar, the problem is that there is a common prefix between two right-hand sides with the same left-hand side. The right-hand sides of S -> aa and S -> ab have the common prefix "a". To see how this causes problems, imagine we are trying to parse ab;. We start with stack S||, and lookahead symbol "a". Since the lookahead is not ";", we choose rule S -> T S and pop S off the stack and push T S in reverse order, leaving us with the stack T S||. Now, with T on top of the stack, we need to choose between rule T -> aa and rule T -> ab. The problem is that both rules are consistent with our lookahead of "a", so we don't know which to choose! Sometimes we can solve this by "factoring out" the common prefix. In this example, we might do it this way:

S -> T S | ;
T -> a T'
T'-> a | b

You should be able to convince yourself that this grammar defines the same language as the previous one. Moreover, this grammar is LL(1). How do I know? Because if I know what's on the top of the stack, and I know the lookahead, it's now unambiguous which rule I want to expand by. You should be able to convince yourself of this.

The other enemy of LL(1)-ness is left recursion, like in the rule from our usual arithmetic grammar:x exp -> exp OPA term. Let's consider a simple grammar with left-recursion:

S -> S @ X | X	
X -> a | b

The problem here is that if we have S on the top of the stack and the lookahead is "a" or "b", we don't know whether to expand with rule S -> S @ X or rule S -> X. Our first attempt to fix this is to turn the left-recursion into right recursion. After all, we just have @-separated lists of a's and b's. Do we really care whether the grammar generates them from left to right or right to left?

S -> X @ S | X	
X -> a | b

We've fixed the left recursion, but know we have a common prefix problem! Specifically, both rules for S have the common prefix X. So, we fix that by factoring out the common prefix:

S -> X T
T -> @ X T | epsilon	
X -> a | b

And this is now LL(1). How can I tell? Well, if S is at the top of the stack, there's only one rule. If X is at the top of the stack, the lookahead will tell us if we want the rule that produces "a" or the rule that produces "b". So what if T is at the top of the stack? Well then either the lookahead is "@", in which case I expand by rule T -> @ X T, or the lookahead is "a" or "b", in which case we have an error, or the lookahead is empty because we've come to the end of the input, in which case we expand by T -> epsilon. This solution to left recursion is really common, but usually we'd call the new non-terminal "T" something like $S_{\text{tail}}$. So the rules $S \rightarrow S @ X\ |\ X$ are replaced with $S \rightarrow X S_{\text{tail}}$, and we add rules $S_{\text{tail}} \rightarrow @ X S_{\text{tail}}\ |\ \epsilon$.

Christopher W Brown