SI413: Making the right choice in top-down parsing

Reading

Programming Language Pragmatics section 2.3 up to 2.3.2.

Homework

hw

Big HW Review

We spent a lot of time looking at the HW, especially the last question. Give grammar:
$E \rightarrow (\ L \ )\ |\ A$
$L \rightarrow E L\ |\ \varepsilon$
$A \rightarrow n\ |\ s$
As we do a top-down parse of any given string using this grammar, we have to make decisions. The HW focused on this question: if there's an L on the top of the stack and you pop it off, should you use the rule $L \rightarrow E L$ or the rule $L \rightarrow \varepsilon$? We need to be able to make that decision, but how? The answer is that if the lookahead character is ')' we use $L \rightarrow \varepsilon$, while if it's anything else, we use $L \rightarrow E L$. This is because from $E$ we can derive strings that start with 's', 'n' and '(', but not that start with ')'. So somehow we need to be able to look at a grammar and for each non-terminal $X$ determine the terminals with which it's possible for strings derived from $X$ to start. The set of such terminals is called FIRST(x).

Starting easy EPS

We say that a non-terminal $X$ is nullible if it is possible to derive the empty string from $X$, i.e. if $X \rightarrow^* \varepsilon$. To compute the FIRST sets for non-terminals, it turns out we need to know which non-terminals are nullible. For grammar $G$. At first blush it looks like this could be difficult. After all, derivations can be complex, with recursive grammar rules all over the place. However, it turns out that there's a beautifully simple algorithm that computes this.

Algorithm: makeEPS
Input: Grammar $G$
Output: EPS - the set of non-terminals in $G$ that are nullible.

EPS := $\{ X\ |\ X \rightarrow \varepsilon \mbox{ is a rule in $G$}\}$

do {
for each rule $X \rightarrow Y_1 Y_2 \cdots Y_k$ in $G$
if $Y_1,Y_2,\ldots,Y_k$ are all in EPS, then add $X$ to EPS

} while the iteration added something new to EPS

return EPS

Try makeEPS on:

	S -> cS | UV | aT
	T -> aa|bb
	U -> aT | ε
	V -> bT | ε

It works! Great ... but does it always work? How could we prove that it always returns exactly the set of nullifiable non-terminals in $G$? Here's how: First, note that it is clear that anything we add to EPS belongs there - i.e. we only add a non-terminal $X$ to EPS if $X$ is indeed nullible. Obviously that's true for Step 1, and it's also clear for Step 2, since we only add $X$ to EPS when we know there is a rule $X \rightarrow Y_1\cdots Y_k$ for which each $Y_i$ is nullible. So what we really have to show is that nothing is missing from our final returned EPS. We need to show that during the algorithm, if there is a nullifiable non-terminal that is not currently in EPS, there must be a grammar rule $X \rightarrow Y_1\cdots Y_k$ for which each $Y_i$ is already in EPS. That way we won't terminate the outer loop while anything is missing from EPS. So let's prove that:

Theorem: if there is a nullifiable non-terminal that is not currently in EPS, there must be a grammar rule $X \rightarrow Y_1\cdots Y_k$ for which each $Y_i$ is already in EPS.
Proof: Let $T$ be a shortest parse sub-tree such that $\mbox{root}(T)$ is not in EPS, and $T$ yields $\varepsilon$, i.e. all leaves are $\varepsilon$. (To clarify, there are potentially many such trees - even infinitely many. Out of all of them, find one that is shortest - if there are ties, I don't care which. That's the one I want!) Note that $T \rightarrow \varepsilon$ cannot be a rule in $G$, since otherwise $\mbox{root}(T)$ would already be in EPS. Therefore, if we draw out the parse tree $T$, it looks like this:

Each $Y_i$ is the root of a subtree that yields $\varepsilon$, and these subtrees are all shorter than $T$. However, we chose $T$ such that it was the shortest tree that yields $\varepsilon$ and has root not already in EPS, which means that each of these $Y_i$'s must already be in EPS. And so we've proved the theorem!

Important: It will be useful to define the function $\mbox{eps}(w)$ for string $w$ to return true if $w = \varepsilon$ or if all symbols in $w$ are in EPS as returned by makeEPS.

A little more complex - FIRST

As noted earlier, we would like to be able to find "FIRST($X$)" for any grammar symbol $X$, which is the set of terminals that appear as the first character in at least one string derived from $X$. I.e. terminal $c$ is in FIRST($X$) if and only if $X \rightarrow^* cw$ for some string $w$. This is relevant, because if $X$ is at the top of the stack, and I'm considering applying rule $X \rightarrow Y$, then lookahead better be in FIRST($Y$). [Note that if $Y$ is nullible, the story is a bit more complicated.]

Algorithm: makeFIRST
Input: Grammar $G$, and EPS, the result of makeEPS($G$)
Output: FIRST - a map that maps a grammar symbol $X$ to the set of terminals that appear as the first character in at least one string derived from $X$. I.e. terminal $c$ is in FIRST($X$) if and only if $X \rightarrow^* cw$ for some string $w$.

for all grammar symbols x, FIRST(x) := {x} if x is a terminal, and { } if it is a non-terminal

do {
for each rule $X \rightarrow Y_1 Y_2 \cdots Y_k$ in $G$
for $i$ from $1$ to $k$
FIRST($X$) := FIRST($X$) $\cup$ FIRST($Y_i$)
if $Y_i \notin \mbox{EPS}$ break out the inner for loop

} while the iteration added something new to FIRST

return FIRST

Important: It will be useful to define the function $\mbox{first}(w)$ that returns the set of all all terminals that could be the first terminal in a derivation that starts with $w$. I.e. first($w$) contains c if $w \rightarrow^* cv$ for some string $v$. If $w$ is the empty string, this is { }. otherwise, if $w = w_1 w_2 \cdots w_k$ we compute as

P = { }
for $i$ from 1 to $k$
P := P $\cup$ FIRST($w_i$)
if $w_i \notin \mbox{EPS}$ break
return P

Still more - FOLLOW

It turns out that sometimes we're interested in what symbols could follow some non-terminal $X$ in a derivation starting from the start symbol. So, if I have partial derivation $S \rightarrow^* uXv$, where $S$ is the start symbol, what could show up at the beginning of $v$? We'd like to define a map "FOLLOW($X$)" that would tell us exactly that for each non-terminal.

Algorithm: makeFOLLOW
Input: Grammar $G$, EPS - the result of makeEPS($G$), and FIRST - the result of makeFIRST($G$)
Output: FOLLOW - a map that maps a grammar symbol $X$ to the set of terminals that appear as the first character in at least one string derived from $X$. I.e. terminal $c$ is in FIRST($X$) if and only if $X \rightarrow^* cw$ for some string $w$.

for all non-terminal symbols X, FOLLOW(X) = { }

do {
for each rule $X \rightarrow Y_1 Y_2 \cdots Y_k$ in $G$
for $i$ from $1$ to $k-1$
if $Y_i$ is a non-terminal then FOLLOW($Y_i$) := FOLLOW($Y_i$) $\cup$ FIRST($Y_{i+1}$)

for $i$ from $k$ down to $1$
if $Y_i$ is a non-terminal FOLLOW($Y_i$) := FOLLOW($Y_i$) $\cup$ FOLLOW($X$)
if $Y_i \notin \mbox{EPS}$ break out of inner for loop

} while the iteration added something new to FOLLOW

return FOLLOW

The payoff ... PREDICT

Finally the thing we really want ... PREDICT. PREDICT is a map that will tell us which grammar rules to use when we do top-down parsing. In particular: Suppose $S$ is the start symbol $S \rightarrow^* uXv$, where $X$ is a non-terminal. The point is that if that $X$ is on the top of the stack in a top-down parse it must mean that whatever string $u$ ultimately produced has already been read in. So either the lookahead should be in FIRST($X$) or, if $X \in$ EPS, the lookahead could also be in FOLLOW($X$). For any rule $X \rightarrow w$ in the grammar, its predict set is defined as:

PREDICT($X \rightarrow w$) := first($w$) if $\mbox{eps}(w) = \mbox{false}$, and first($w$) $\cup$ FOLLOW($X$) otherwise

Thus, each grammar rule gets its own predict set. As long as every pair of rules with the same left-hand side non-terminal have non-intersecting predict sets, the grammar is LL(1) and we can do top-down parsing.