SI413: Bottom Up Parsing / CFSM

Reading

Section 2.3.3 of Programming Language Pragmatics.

Homework

hw

Getting a handle on the stack in bottom up parsing

The Characteristic Finite State Machine (nondeterministic)

A rule with the • in it is called an "LR item". Think of them as a goal: E→•T means "I'm trying to read E by reading T", E→E+•T means "I'm trying to read E by reading E+T and I've already read E+". Each transisition represents moving to a subgoal that helps achive the previous goal in a way consistent with the grammar. Here are a few things to note about the way the LR items "transition" in our example.

Note 1: Continuing with the parse tree picture from above, as we go down the parse tree we "transition" from an LR item A→w1•Bw2 to LR item B→•u.

Example 1A	Example 1B
S → •E transition to E → •E+T - - - --- A B B u (w1 and w2 are ε)	E → E+•T transition to T → •n - -- - - - A w1 B B u (w2 is ε)

Note 2: Continuing with the parse tree picture from above, as we go accross the parse tree (sibling to sibling) we "transition" from an LR item A→w1•xw2 to LR item A→w1x•w2.

Example 2A	Example 2B
E → E•+T transition to E → E+•T - / - \ - -- - A w1 x w2 A w1x w2	E → •E+T transition to E → E•+T - --- - - -- A xw2 A x w2 (w1 is ε)

This ought to give you an idea: since we "transition" from one LR item to the next, maybe we can construct a finite automaton whose states are the LR items from the grammar, and whose transitions correspond to Note 1 and Note 2. We call this the Charactestic Finite State Machine (CFSM), and since there are ε-transitions this will be a non-deterministic machine.

Note: One technical detail ... You'll notice that example grammar we have has a single rule with the start symbol on the left-hand side. We will actually require that for our construction. In practice, we usually take the original grammar (let's say with start symbols S) and add a rule S'→S$, where S' is a new start symbol, and $ represents the end of the input (could be EOF could be a newline, or something else ... depends on the situation).

Algorithm: ND-CFSM-Construct
Input: grammar G = (NT,Σ,R,S) such that only one rule has S as LHS.
Output: NDFA M = (Q,Σ∪{•},Δ,Q)
where
Q = { A→w1•w2 | A→w1w2 ∈ R} ← i.e. Q is the set of LR items from G
Δ = { (A→w1•Bw2,ε,B→•u) | A→w1Bw2 ∈ R and B→u ∈ R } ∪ { (A→w1•xw2,x,A→w1x•w2) | A→w1xw2 ∈ R and x ∈ Σ ∪ NT } ← i.e. transitions from Note 1 and Note 2

This is actually a super simple algorithm. For example, if we start with the grammar from above:

	  S -> E
	  E -> E+T | T
	  T -> n

we build the machine as follows (remember, all states are accepting states!):

List the LR items for each rule.	Make those LR items the states.	Add the "Note 2" transitions.	Add the "Note 1" transitions.

Important! The big claim (and this is something that isn't obvious and certainly requires proof at some point) is that The NDFA produced by the algorithm accepts exactly the language of all stacks (from bottom to top) that appear in successful bottom-up parses of strings in L(G).