SI413: SLR Parsing

Reading

Section 2.3 of Programming Language Pragmatics.

Homework

hw due Wed, 30 September

HW Review

We went over the HW in class, and we actually used the grammar and machine from that HW to cover this class's material: namely how to make parsing decisions using the Non-deterministic Characteristic Finite State Machine, how to convert that nondeterministic CFM into a deterministic CFM, and how to use that deterministic CFM to more efficiently make parsing decisions. When we added lookahead and the use of FOLLOW sets to mitigate shift-reduce conflicts, we ended up with the SLR(1) parsing!

Parsing with the non-deterministic characteristic finite state machine

This machine we build this way is the non-deterministic "Characteristic Finite State Machine" (CFSM). We can convert to an equivalent deterministic machine, and that is what is usually meant by CFSM.

SLR Parsing

This procedure is called "Simple LR" parsing, or SLR. You should notice that in our example we can parse any string without any lookahead (as long as we can detect when the input is done), therefore this grammar is SLR(0), i.e. SLR parsing with zero lookahead tokens suffices.

SLR(1) Parsing

More complicated grammars may not be SLR(0), so we may need lookahead. In fact, if we add multiplication and the usual prcedence, we get the grammar

S → E
E → E + T | T
T → T * F | F
F → n | (E)

which produces a deterministic CSFM that includes the following fragment:

Now, if we end up in the left state, we have some ambiguity: LR item T→T•*F tells us to shift (push a token), while LR item E→T• tells us to reduce T to E. This is called a "shift/reduce conflict". We can actually resolve this particular shift-reduce conflict by using one token's worth of lookahead. Here's how:

It's clear that if the lookahead is not *, we must reduce, since we only have an outgoing edge for *.
If we reduce by E -> T we would be commiting to a derivation in which E is followed by *. However, if we look closely at the grammar, we see that we cannot have a derivation in which an E symbol is followed immediately by a *. (Try making it happen, you can't:
```
S => E => E+T => E+T*F ... rats! there's a "+T" in between the E and the *, and +T can't go to ε 
```
... keep trying, you can't make it happen!) Therefore, when the look ahead is *, we must shift.

Of course, to make this kind of disambiguation in general, we need to be able to analyze a grammar to determine which symbols can follow which other symbols in a derivation. We refer to these as "FOLLOW sets". Specifically, with respect to grammar $G = (NT,\Sigma,R,S)$, FOLLOW$(A)$ for non-terminal $A$ is defined as $\{ x \in \Sigma\ |$ there exist strings $u$ and $v$ for which $uAxv$ is derivable in $G\}$. The First/Follow/Predict Calculator from the course resources computes follow sets for you, and it determines that FOLLOW(E) = { '+' , ')'}, which verifies the claim above that * can't follow an E in this grammar. Of course this still leaves open the question of how to compute these follow sets!

Note that in other scenarios we might have two different LR items in the same state that tell us to reduce but, of course, using different reductions. This is a "reduce/reduce conflict". Sometimes lookahead can disambiguate this kind of conflict, but not in SLR parsers.

Constructing the CFSM directly from the grammar

Section 2.3 of Programming Lanugage Pragmatics covers how to build the deterministic CFSM directly from the grammar, rather than starting with the Non-deterministic CFSM.