SI413: Top-down - Predictive and recursive-descent parsers

Reading

Section 2.3.2 of the text.

Homework

None.

Predictive parsing

The basic structure of top-down parsing for LL(1) grammars is given in the following pseudo-code: What this shows is that the only difficult part of the process is "chooseRule(X,lookahead)", i.e. you have the non-terminal X that you just popped of the stack, and the "lookahead" token, and you must choose a rule of the form $X \rightarrow w$, and "expand" by pushing the symbols in $w$ onto the stack in reverse order (so that the first symbol is on the top of the stack).

Let's consider our good friend the grammar

S -> X T
T -> @ X T
T -> epsilon
X -> a
X -> b

... where "epsilon" means the empty string. There are three non-terminals (S,T,X) and three terminals (a,b,@). To this we will add the token $, which will denote the end of input. That means there is, at least conceptually, a new rule R -> S $ where R is the "real" start symbol. We won't really use R and this new rule in derivations, but it allows us to say concretely what we mean by the end-of-input terminal $. In any event, this means that there are only 3x4 = 12 possible ways to call chooseRule. So we might as well write them down in a table and turn chooseRule into a simple table lookup. Here's the table for this grammar:

$ \begin{array}{r|cccc} & a & b & @ & \text{\$}\\ \hline S & S \rightarrow X T & S \rightarrow X T & err & err\\ T & err & err & T \rightarrow @ X T & T\rightarrow \epsilon\\ X & X \rightarrow a & X \rightarrow b & err & err \end{array} $

To show you how close this pseudo code is to a real program, I implemented a parser for this grammar in Javascript. Check out the source code. You can give it a try. Enter a string like "a@b@a" and press the enter key. Then try "a@b@@a" and see what happens.

Input:
Output:

Predict sets

The table from the previous section can be computed automatically from the grammar. In fact, we'll learn how to do that next class. For the moment, I invite you to play with the First/Follow/Predict Calculator (also linked under class resources) and copy and paste the grammar into its input box. Under "PREDICT" you'll see a listing of each of the grammar rules. Each has a colon after it, and a space-separated list of terminal symbols. These comprise the "predict set" for that grammar rule. The meaning here is this: if $x$ is in the predict set for rule $A \rightarrow w$, then if $A$ is popped off the stack and the lookahead is $x$, expanding by the rule $A \rightarrow w$ is a valid parsing move, meaning that there are inputs that continue from the lookahead character $x$ for which expanding by the rule $A \rightarrow w$ is the right parsing choice.

A grammar is LL(1) precisely when the predict sets for any two rules with the same left-hand-side are disjoint. After all, if we had two rules $A \rightarrow w_1$ and $A \rightarrow w_3$, and some terminal $x$ was in both of their predict sets, it would mean that expanding with either rule could be valid depending on what comes after the lookahead $x$. So you have to use more than that one symbol of lookahead to know which rule to use.

Recursive Descent Parsing

Recursive descent parsing is a different approach to top-down parsing for LL(1) grammars. Predictive parses and the bottom-up parsers we will describe later follow the same model you will have seen in Theory of Computing: "the stack" is a stack that holds grammar symbols, i.e. non-terminals and terminals (sometimes annotated with semantic information, like "NUM[42]" instead of just "NUM"). Recursive descent parsing uses the function call stack instead. This means that "the stack" holds function call records, not grammar symbols. We'll describe how to write one using a simple example, and then we'll discuss how it relates to predictive parsing.

Here is a grammar for a simple language for arithmetic, in which + and * are functions not operators. The tokens are NUM, ID, LP and RP. Examples of strings are: +(34 12) and *(10 +(7 2) -1). The predict sets for the grammar rules, computed with the tool from course resources, are also shown.

Grammar	Predict sets
exp -> ID LP etail exp -> NUM etail -> exp etail etail -> RP	PREDICT: exp -> ID LP etail : ID exp -> NUM : NUM etail -> exp etail : ID NUM etail -> RP : RP

We will assume that we have a function peek() that returns the next lookahead token, a function match(tok) that consumes the next token and throws an error if it doesn't match tok, and a function paerror(str) that prints an error based on str, and exits. The idea is that each non-terminal will be turned into a function that matches the next portion of the token stream to something that is generated by that non-terminal. If we name these functions after the non-terminals, the connection should be clear: A full working program consists of: rdex.h, rdex.lpp, rdex.cpp, Makefile,

To understand the relation between recrusive descent parsing and what we have previously described as top-down parsing (predictive parsing), it helps to look at an example string. Consider parsing +(10 32). The two images below depict the situation at the point at which we are just about to match NUM[10] for both a predictive parser and a recursive descent parsers. Moreover, it shows the relationship between the stack and the parse tree for the string.

What you should see, is that as our predictive parser traverses the parse tree, its stack can be viewed as a frontier of parse tree nodes. Everything above it used to be on the stack, but has already been popped off. Everything below it has never been on the stack, i.e. has yet to be visited in the traversal. What's on the stack, of course, is waiting to be matched/expanded. With the recursive descent parser, we see that the traversal represents a standard depth-first traversal, and the stack shows the current path. Unlike the predictive parser, we don't "expand" a non-terminal all at once. As you can see, for the etail on the stack in the picture, we have added the first child (exp) to the stack, but will only add the second child (etail) later, when the traversal backtracks. Crucially, in both cases, we are traversing the parse tree, which is, after all, the point of parsing. If we encountered an input that was not in the language generated by the grammar, the traversal would fail and we'd spit out an error.

Evaluation in Recursive Descent Parsing

The recursive descent parser pseudo-code above is a proper parser: it traverses the parse tree, and it let's us know whether the input string is in the language or not. However, we usually want to do more. In this example, we should be able to evaluate as we parse. For inatance, if the input is +(3 *(9 3) 5) we should not only be told that the string is valid, we should also be told that the result is 14. We do this by augmenting the functions in the recursive descent parser to return values. The tricky thing is that, because we had to break up expressions into exp + etail, we have information to pass along into e-tail. For example:

	     ____ etail() reads this
	    /    \
	+ ( 12 9 )
	\__/
         exp() reads this

So etail() only sees the numbers, it doesn't know whether it should be adding or multiplying. Thus, we need to pass the operator (+ or *) to etail as a parameter. Similarly, as etail calls itself recursively, it doesn't know what numbers have already been added or multiplied already. So we need to pass it the value accumulated already (much like the "accumulator" we used so often for tail-recursion). To handle this, the match() function will return a value too --- the numerical value for NUM tokens, the ASCII value of + or * for OP, and anything you like for LP and RP.