Reading

Section 2.3 up to 2.3.1 of Programming Language Pragmatics.

Homework

Do the hw.

Reprising scanning

Recall that if we're writing a scanner by hand, we'll usually draw out a detemrinistic finite automaton (DFA) that recognizes the tokens, then implement that automaton in code ... perhaps with nested switch/case statements. If we make a lanugage change that adds or removes or modifies tokens, it may have a big effect on the DFA, and thus require substantial rewriting of the code that implements the DFA

Tools like flex work a different way. As users we specify tokens using regular expressions. Then flex builds FA's for each regular expression and merges all the resulting FA's. The resulting FA is turned into a DFA which is then minimized. This is then converted into code --- usually with a state-character table lookup. The advantage, as I hope you saw in the lab, is that changes in the language that add or remove or modify tokens usually require minor changes to a flex file. The major changes to the DFA that result are all handled automatically.

Parsing

Parsing is the process of taking a grammar and a string of tokens and producing, either implicitly or explicitly a derivation for the token-string in the grammar. Of course if no such derivation is possible, we have a parse error. This is not an entirely new problem for us. Recognizing strings in a language defined by a context-free grammar was something we said in Theory class we could do with a push-down automaton, i.e. a NDFA+stack. Of course in practice, we can't really accept non-determinism, so our struggle will be to use deterministic control along with a stack.

There are two basic approaches to parsing: top-down or bottom-up. They primary difference is how they use the stack. In bottom-up parsing, which we discussed in Theory, symbols are pushed onto a stack and, when we recognize the right-hand side of a grammar rule on top of the stack (in reverse order, actually), we pop those symobls off the stack and push the non-terminal symbol on the left-hand side of the rule. In top-down parsing we pop a non-terminal off the stack, look for a rule with that symbol on the left-hand side, then push the symbols comprising the right-hand side of that rule onto the stack (in reverse order, actually). If a bottom-up parser successfully parses a string, we should finish with the start symbol on the stack. When a top-down parser parses a string it begins with the start symbol on the stack. Consider the grammar $A \rightarrow abc$:

Top downBottom up
input: abc        stack:         A||	  
                                     # pop A and push a b c in reverse order
       abc        stack:     a b c|| 
                                     # read a and pop a
        bc        stack:       b c||
                                     # read b and pop b
         c        stack:         c||
                                     # read c and pop c
                  stack:          ||
input: abc        stack:          ||	  
                                     # read a and push a
        bc        stack:         a||
                                     # read b and push b
         c        stack:       b a||
                                     # read c and push c
                  stack:     c b a||
                                     # pop a b c in reverse order and push A
                  stack:         A||

When we have more than one rule in the grammar things get more interesting.

We discussed LL and LR. Look in the book to find the difference.

Top-down / bottom-up demo

In class we looked at a very simple grammar:
S -> T S | ;
T -> aa | bb
	
ADD DEMO NOTES and walked through bottom-up and top-down parsing of a simple string like bbaa;. We looked at the top-down in more detail, and what we really noticed was that at some points we had a decision to make. Namely, if you pop T from the stack, should you choose the aa right-hand side or the bb right-hand side. What we determined was that making this decision requires peeking at the next input symbol. In this case it suffices to simply peek at one token. Grammars that we can parse top-down with just one character of lookahead are called LL(1). In a grammar like this:
S -> T S | ;
T -> aa | ab
just one character of lookahead isn't enough. You need two! When two characters aren't enough, maybe you need three. In general, though, what we should try to do is rewrite the grammar so that the problem goes away. Here the problem is that there is a common prefix between two right-hand sides with the same left-hand side. The other enemy of LL(1)-ness is left recursion, like exp -> exp OPA term. At any rate, we can rewrite the above grammar as
S -> T S | ;
T -> aX | aX
X -> a|b
... which generates the same language, but which is LL(1).


Christopher W Brown