SI413: First, Follow and Predict

Reading

Programming Language Pragmatics section 2.3 up to 2.3.2.

Homework

hw due Wed, 30 September

Review from earlier homework

When we reviewed the homework from the "Top Down / Bottom Up" lesson, we spent a lot of time looking at the last question. Given grammar:
$E \rightarrow (\ L \ )\ |\ A$
$L \rightarrow E L\ |\ \varepsilon$
$A \rightarrow n\ |\ s$
... as we do a top-down parse of any given string using this grammar, we have to make decisions. The HW focused on this question: if there's an L on the top of the stack and you pop it off, should you use the rule $L \rightarrow E L$ or the rule $L \rightarrow \varepsilon$? We need to be able to make that decision, but how? The answer is that if the lookahead character is ')' we use $L \rightarrow \varepsilon$, while if it's anything else, we use $L \rightarrow E L$. This is because from $E$ we can derive strings that start with 's', 'n' and '(', but not that start with ')'. So somehow we need to be able to look at a grammar and for each non-terminal $X$ determine the terminals with which it's possible for strings derived from $X$ to start. The set of such terminals is called FIRST(x). We need to analyze grammars to find things like the FIRST sets, and the Predict sets and the FOLLOW sets. That's what we're covering this class.

Starting easy EPS

We say that a non-terminal $X$ is nullible if it is possible to derive the empty string from $X$, i.e. if $X \Rightarrow^* \varepsilon$. To compute the FIRST sets for non-terminals, it turns out we need to know which non-terminals are nullible. At first blush it looks like this could be difficult. After all, derivations can be complex, with recursive grammar rules all over the place. However, it turns out that there's a beautifully simple algorithm that computes this.

Algorithm: makeEPS
Input: Grammar $G$
Output: EPS - the set of non-terminals in $G$ that are nullible.

EPS := $\{ X\ |\ X \rightarrow \varepsilon \mbox{ is a rule in $G$}\}$

do {
for each rule $X \rightarrow Y_1 Y_2 \cdots Y_k$ in $G$
if $Y_1,Y_2,\ldots,Y_k$ are all in EPS, then add $X$ to EPS

} while the iteration added something new to EPS

return EPS

Try makeEPS on:

	S -> cS | UV | aT
	T -> aa|bb
	U -> aT | ε
	V -> bT | ε

It works! Great ... but does it always work? How could we prove that it always returns exactly the set of nullible non-terminals in $G$? Here's how: First, note that it is clear that anything we add to EPS belongs there - i.e. we only add a non-terminal $X$ to EPS if $X$ is indeed nullible. Obviously that's true for Step 1, and it's also clear for Step 2, since we only add $X$ to EPS when we know there is a rule $X \rightarrow Y_1\cdots Y_k$ for which each $Y_i$ is nullible. In fact (not to sow panic amongst my students), "if $X \in $ EPS then $X$ is nullible" is a loop invariant for both the inner and outer loop.

Since "if $X \in $ EPS then $X$ is nullible" is a loop invariant, when we leave the outer loop, we know that every $X$ in EPS is nullible. So, what we really have to show is that nothing is missing from our final returned EPS, i.e. if $X$ is nullible then $X \in$ EPS. To do this, we will show that during the algorithm, if there is a nullible non-terminal that is not currently in EPS, there must be a grammar rule $X \rightarrow Y_1\cdots Y_k$ such that each $Y_i$ is already in EPS, but $X$ is not in EPS. That way we won't terminate the outer loop while anything is missing from EPS. So let's prove that:

Theorem: if there is a nullible non-terminal that is not currently in EPS, there is a grammar rule $X \rightarrow Y_1\cdots Y_k$ for which each $Y_i$ is already in EPS.
Proof: Let $T$ be a shortest parse sub-tree whose root, which we will call $X_T$, is not in EPS, and $T$ yields $\varepsilon$, i.e. all leaves are $\varepsilon$. (To clarify, there are potentially many such trees - even infinitely many. Out of all of them, find one that is shortest - if there are ties, I don't care which. That's the one I want!) Note that $X_T \rightarrow \varepsilon$ cannot be a rule in $G$, since otherwise X_T would already be in EPS from Step 1. Therefore, if we draw out the parse tree $T$, it looks like this:

Each $Y_i$ is the root of a subtree that yields $\varepsilon$, and these subtrees are all shorter than $T$. However, we chose $T$ such that it was the shortest tree that yields $\varepsilon$ and has root not already in EPS, which means that each of these $Y_i$'s must already be in EPS. So $X_T \rightarrow Y_1 \cdots Y_k$ is the grammar rule the theorem said would exist, and we are done.

Important: It will be useful to define the function $\mbox{eps}(w)$ for string $w$ to return true if $w = \varepsilon$ or if all symbols in $w$ are in EPS as returned by makeEPS.

A little more complex - FIRST

As noted earlier, we would like to be able to find "FIRST($X$)" for any grammar symbol $X$, which is the set of terminals that appear as the first character in at least one string derived from $X$. I.e. terminal $c$ is in FIRST($X$) if and only if $X \rightarrow^* cw$ for some string $w$. This is relevant, because if $X$ is at the top of the stack, and I'm considering applying rule $X \rightarrow Y$, then lookahead better be in FIRST($Y$). [Note that if $Y$ is nullible, the story is a bit more complicated.]

Algorithm: makeFIRST
Input: Grammar $G$, and EPS, the result of makeEPS($G$)
Output: FIRST - a map that maps a grammar symbol $X$ to the set of terminals that appear as the first character in at least one string derived from $X$. I.e. terminal $c$ is in FIRST($X$) if and only if $X \rightarrow^* cw$ for some string $w$.

for all grammar symbols x, FIRST(x) := {x} if x is a terminal, and { } if it is a non-terminal

do {
for each rule $X \rightarrow Y_1 Y_2 \cdots Y_k$ in $G$
for $i$ from $1$ to $k$
FIRST($X$) := FIRST($X$) $\cup$ FIRST($Y_i$)
if $Y_i \notin \mbox{EPS}$ break out the inner for loop

} while the iteration added something new to FIRST

return FIRST

It's good to try working through the algorithm by hand on a sample grammar. Even one as simple as the example above will do.

Now for the big question: How do we know that this algorithm is correct? I.e. that "terminal $c$ is in FIRST($X$) if and only if $X \Rightarrow^* cw$ for some string $w$"? It should be clear that if we add $c$ to FIRST($X$) in the innermost loop, then there must be a derivation "$X \Rightarrow^* cw$ for some string $w$". In fact, we can show that to be an invariant for all three loops. So what remains to be proven is that if there is a non-terminal $X$ and terminal $c$ such that $X \Rightarrow^* cw$ for some string $w$, but $c \notin $ FIRST($X$) when we start another iteration of the do-while loop, then something new will be added during that do-while. That will show that we cannot exit the do-while until FIRST is exactly what we need it to be.

Theorem: If there is a non-terminal $X$ and terminal $c$ such that $c \notin$ FIRST[$X$] but $X \Rightarrow^*_G cu$ for some $u$, then the the next pass through the do-while loop will add $c$ to FIRST[$X'$] for some non-terminal $X'$ that doesn't currently have $c$ is its FIRST set.
Proof: Consider a sub-parsetree rooted at $X$ that represents a derivation in which $c$ is the first symbol in the derived string of terminals. Consider the path from the $X$ at the root to the leftmost leaf labelled $c$ (i.e. the first terminal symbol). For each non-terminal $X'$ along that path, the subtree with that non-terminal as its root represents a derivation $X' \Rightarrow^*_G c u$, for some $u$. So for each such $X'$, $c$ is supposed to be in FIRST[$X'$] when the algorithm returns. Let us fix $X'$ as the deepest vertex on that path such that $c \notin$ FIRST[$X'$] at the current point in the algorithm. Denote the children of $X'$ as $Y_1, Y_2, \ldots, Y_k$, and let $Y_j$ be the next vertex down the path. Then $Y_1,\ldots,Y_{j-1}$ are all nullible, and thus in EPS, and, at the current point in the algorithm, $c \in $ FIRST[$Y_j$] (otherwise $X'$ would not be the deepest vertex as we assumed). Thus, when the algorithm considers the rule $X' \rightarrow Y_1 Y_2 \cdots Y_k$, $c$ will be added to FIRST[$X'$]. And so the theorem is proved.

Important: It will be useful to define the function $\mbox{first}(w)$, where $w$ is a string of terminals and non-terminals, that returns the set of all all terminals that could be the first terminal in a derivation that starts with $w$. I.e. first($w$) contains c if $w \rightarrow^* cv$ for some string $v$. If $w$ is the empty string, this is { }. otherwise, if $w = w_1 w_2 \cdots w_k$ we compute as

Algorithm: first$(w_1 w_2 \cdots w_k)$
P = { }
for $i$ from 1 to $k$
P := P $\cup$ FIRST($w_i$)
if $w_i \notin \mbox{EPS}$ break
return P

FOLLOW - what we need for bottom up parsing

It turns out that sometimes we're interested in what terminal symbols could follow some non-terminal $X$ in a derivation starting from the start symbol. So, if I have partial derivation $S \rightarrow^* uXv$, where $S$ is the start symbol, what could show up at the beginning of $v$? We'd like to define a map "FOLLOW($X$)" that would tell us exactly that for each non-terminal.

Algorithm: makeFOLLOW
Input: Grammar $G$, EPS - the result of makeEPS($G$), and FIRST - the result of makeFIRST($G$)
Output: FOLLOW - a map that maps a non-terminal grammar symbol $X$ to the set of terminal symbols that can appear in a string derived from $G$ as the very next symbol after a substring derived from $X$. In other words, $a \in \text{FOLLOW}(X)$ if and only if $S \Rightarrow_G^* uXav$.

for all non-terminal symbols X, FOLLOW(X) = { }

do {
for each rule $X \rightarrow Y_1 Y_2 \cdots Y_k$ in $G$
for $i$ from $1$ to $k-1$
if $Y_i$ is a non-terminal then FOLLOW($Y_i$) := FOLLOW($Y_i$) $\cup$ first($Y_{i+1} Y_{i+2} \cdots Y_k$)

for $i$ from $k$ down to $1$
if $Y_i$ is a non-terminal FOLLOW($Y_i$) := FOLLOW($Y_i$) $\cup$ FOLLOW($X$)
if $Y_i \notin \mbox{EPS}$ break out of inner for loop

} while the iteration added something new to FOLLOW

return FOLLOW

PREDICT - what we need for top-down parsing

Finally the thing we really want for top-down parsing ... PREDICT. PREDICT is a map that will tell us which grammar rules to use when we do top-down parsing. In particular: Suppose $S$ is the start symbol $S \rightarrow^* uXv$, where $X$ is a non-terminal. The point is that if that $X$ is on the top of the stack in a top-down parse it must mean that whatever string $u$ ultimately produced has already been read in. So either the lookahead should be in FIRST($X$) or, if $X \in$ EPS, the lookahead could also be in FOLLOW($X$). For any rule $X \rightarrow w$ in the grammar, its predict set is defined as:

PREDICT($X \rightarrow w$) := first($w$) if $\mbox{eps}(w) = \mbox{false}$, and first($w$) $\cup$ FOLLOW($X$) otherwise

Thus, each grammar rule gets its own predict set. As long as every pair of rules with the same left-hand side non-terminal have non-intersecting predict sets, the grammar is LL(1) and we can do top-down parsing.