SI413: LR parsing & precedence/associativity

Parse trees for programs

Consider a simple programing language with ifs, while-loops, variable assignment, boolean and comparison operators, and arithmetic. Below we give the language's grammar and a simple example program:

Grammar for simple language prog1.txt

res: block | stmt

block: LB stmtlist RB 

stmtlist: stmtlist stmt | stmt

stmt: ID ASN exp STMTTERM 
    | IF LP exp RP block
    | IF LP exp RP block ELSE block
    | WHILE LP exp RP block
    | READ ID STMTTERM
    | WRITE exp STMTTERM

exp: exp OROP aexp | aexp
aexp: aexp ANDOP nexp | nexp
nexp: NOT cexp | cexp
cexp: cexp COMPOP sexp | sexp
sexp: sexp OPA term | term
term: term OPM factor | sfactor 
sfactor: OPA factor | factor
factor: NUM | ID | LP exp RP | BCONST

# Computes n!
{
  read n;
  f := 1;
  while(n > 0)
  {
     f := f * n;
     n := n - 1;
  }
  write f;
}

Notice that this grammar encodes the precendece and associativity rules for OR, AND, NOT <, > etc.,+/-, and */div. This is done it the usual way. That forces a certain amount of complexity in the grammar. Take a look at the parse tree for prog1.txt. (It's postscript, so view with the "evince" program.) You should be able to read off the predecences here.

What has lowest precedence?
Draw the parentheses implicit in
```
not  3  +  4  *  5  <  6  -  x  and  y  <  0  or  false
```
due to operator precedences.
Consider
```
3 < 4 < 5
	    
```
This is syntactically valid for an cexp. How could I modify the grammar so that this would not be syntactically valid?

Specifying precedence and associativity for LR parsers

What's wrong with the grammmar

exp: exp OPA exp | exp OPM exp | NUM | LP exp RP

for arithmetic expressions? It's ambiguous, which means that there's more than one parse tree for the same string. The unambiguous grammar for this language that we've used in the past is more complex, but it's unambiguous, and the unique parse tree it yields respects our associativity and precedence requirements. However, not only is the grammar complex, but the parse trees are huge (see the above) with lots of subtrees that are just "reinterpretation" steps, e.g. in x := 5 ; we have

exp -> term -> sfactor -> factor -> NUM(5)

just to interpret 5 as the right-hand side of an assignment. Not only are the parse trees huge, but the parser takes a lot of steps simply to make all those reinterpretations.

In class we looked at what happens with an LR parser if we use the ambiguous grammar above. What happens, of course, is lots of shift/reduce and reduce/reduce conflicts. But why don't we keep the CFSM produced from this grammar, which is nice and small, and augment the machine with some rules it can use to resolve these conflicts; rules that stem from our associativity and precedence expectations. This works! We get a simpler grammar, a smaller CFSM, a faster parser (since it's not making all those extra moves), and a simpler parse tree. Everyone's happy!

The question is, can we generalize this? Can we augment parser generators like bison with a mechanism by which the input tells the tool how to diabiguate? The answer is yes (of course). The yacc/bison input file can include specifications of associativity and precedence for some or all tokens. Each rule gets a precedence which it inherits from the the right-most token in the rule. (Additionally, rules can be assigned a precedence level explicitly.)

Section 5.3.5 The Bison 3.6.3 Manual:
The first effect of the precedence declarations is to assign precedence levels to the terminal symbols declared. The second effect is to assign precedence levels to certain rules: each rule gets its precedence from the last terminal symbol mentioned in the components. (You can also specify explicitly the precedence of a rule. See section Context-Dependent Precedence.)
Finally, the resolution of conflicts works by comparing the precedence of the rule being considered with that of the lookahead token. If the token’s precedence is higher, the choice is to shift. If the rule’s precedence is higher, the choice is to reduce. If they have equal precedence, the choice is made based on the associativity of that precedence level. The verbose output file made by ‘-v’ (see section Invoking Bison) says how each conflict was resolved.
Not all rules and not all tokens have precedence. If either the rule or the lookahead token has no precedence, then the default is to shift.

Associativity of tokens are assigned by "%left token-name", "%right token-name", and "%nonassoc token-name" statements in the bison file. These come before the grammar rules, and the Relative precedence of these tokens is defined by the order in which the statements appear: first in the file has lowest precedence, last in the file has highest precedence. To assign a rule a precedence explicitly, you put "%prec token-name" after the rule's right-hand side. Sometimes you use "dummy" token names just to make a precedence level to assign a rule. For arithmetic we'd say:

%nonassoc LP RP NUM 
%left OPA
%left OPM
%right UPLUSMINUS

%%

exp: exp OPA exp
|    exp OPM exp
|    OPA exp        %prec UPLUSMINUS
|    NUM
|    LP exp RP

%%

Notice how we used the dummy token UPLUSMINUS to get unary minus's as in 3 + (-5*3 - 8) to be handled properly. The scanner never returns such a token, it's sole purpose is to create the proper precedence level.

Suppose I wanted to add exponentiation, with a POW token for the ^ symbol. Add on the the above to make this change. Exponentiation should have the highest possible precedence, so
```
2 + - 3 ^ 2 * 6  <===>  2 + ((- (3 ^ 2)) * 6)
```
and the associativity of exponentiation should work like this:
```
2^3^4  <===>  2^(3^4)
```

Part 1: Using associativity and precedence to simplify pat

Make sure you understand this: Adding precedence and associativity declarations to an ambiguous grammar allows LR parsers to produce the unique parse tree we want. But it only works for some grammars/languages. Fortunately, this includes a lot of what we want for programming languages.

So the simple progamming language above gets a simpler grammer definition now that we can use precedence and associativity.

Grammar for simple language prog1.txt

%nonassoc ID ASN STMTTERM IF ELSE WHILE
          READ WRITE LP RP NUM BCONST LB RB
%left OROP
%left ANDOP
%right NOT
%left COMPOP

%left OPA
%left OPM
%right UPLUSMINUS

%%

res: block | stmt

block: LB stmtlist RB

stmtlist: stmtlist stmt |stmt         

stmt: ID ASN exp STMTTERM
|     IF LP exp RP block
|     IF LP exp RP block ELSE block
|     WHILE LP exp RP block
|     READ ID STMTTERM
|     WRITE exp STMTTERM

exp: exp OPA exp |exp OPM exp 
|    OPA exp %prec UPLUSMINUS    
|    exp COMPOP exp
|    NOT exp | exp ANDOP exp | exp OROP exp  
|    NUM | ID | LP exp RP | BCONST     

%%

# Computes n!
{
  read n;
  f := 1;
  while(n > 0)
  {
     f := f * n;
     n := n - 1;
  }
  write f;
}

Take a look at the parse tree for prog1.txt from this grammar. (It's postscript, so view with the "evince" program.) You should notice that it's a lot simpler! Make sure you understand how this code works!

If when you run this on your laptop you get an error that the program "dot" was not found, install "graphviz".

sudo apt install graphviz

Download lab07.tgz to your lab07 directory. Unpack it like this:

> tar xfz lab07.tgz

For this part of the lab we will be looking at pat.ypp and the associated executable pat. The pat program reads statements in the pat language and either gives the message "parser error!" (when the input is not syntactically correct), or pops up an "evince" window with a rendering of the parse tree.

Here is the original grammar in pat.ypp in case you want to look back at it after you've modified it.

S: seq STOP 
| 
seq: seq FOLD cseq
| cseq
cseq: cseq pseq
| pseq
pseq: pseq COLON NAME
| pseq POP
| aseq
aseq: ATOM
| NAME
| LB seq RB

So cd to the lab07 directory (which the above created) and read the README file to find out how to compile and run the pat program ... as well as to discover a bit about what it does. Play with it a bit: look at the grammar, enter some expressions, and look at the parse trees you get. This grammar encodes the precedences and associativities of operators, which is why it's complex. Here are some questions you should answer before moving on. Use a combination of examining the grammar and testing the original pat program on some example strings to figure them out.

Is "concatenation" left or right associative? Try input a b c; .
Is "fold" (i.e. *) left or right associative? Try input [a b]*[u v]*[x y]; . What pattern should result from this?!?!?
Which has higher precedence, "fold" or "concatenation"? What input could you try to test your theory?
What about the precedence of POP (i.e. "_r")? Try input [a b] * [b c] _r; ... what does that tell you?
What about the precedence of variable binding? Try input a b : X c; ... what does that tell you?
What sequence should result from input [a b]*[c d]:X_r X X_r; ?

Your job for Part 1 is to modify pat.ypp so that S and seq are the only non-terminals! You'll need to figure out the proper associativities and precedences and how to set them in bison.

Hint: Concatenation doesn't have any operator associated with it, so you'll have to use a dummy token and assign the concatenation rule its precedence with the %prec statement.
Hint: You'll probably end up with some shift/reduce errors. You need to resolve these by looking at the bison output file pat.output. You'll see the states of the CFSM, along with the LR items that label those states and the actions for various lookahead symbols. If there's more than one action listed for a lookahead symbol, there's a shift-reduce error, i.e. an ambiguity as to what action to take. You can resolve those ambiguities by assigning the appropriate precedence (or associativity) to to the tokens involved in the shift-reduce erros. For example, if state 14 has this conflict :
```
state 14

    2 seq: seq . FOLD seq
    2    | seq FOLD seq .
    3    | seq . seq
    4    | seq . COLON NAME
    5    | seq . POP

    POP    shift, and go to state 9
    COLON  shift, and go to state 10
    ATOM   shift, and go to state 1
    NAME   shift, and go to state 2
    LB     shift, and go to state 3

    ATOM      [reduce using rule 2 (seq)]
    NAME      [reduce using rule 2 (seq)]
    LB        [reduce using rule 2 (seq)]
    $default  reduce using rule 2 (seq)

    seq  go to state 12
```
I want to make sure that I don't reduce by "2 seq: seq FOLD seq •", since FOLD is supposed to have lowest precedence. I want to shift an ATOM or NAME or LB so that I build up the seq on the right as well as the left before I finally reduce by the FOLD rule. If I give ATOM, NAME and LB higher precedence than FOLD, bison will shift instead of reduce.

Important: If you get stuck struggling to get the proper associativity and precedences - especially for concatenation - take a look at Associtivity & Precedence EI.

Part 2: Create pati, a full interpreter for pat

Look at pati.ypp and pati.lpp. They will be your starting point. Transfer your precedence and associativity grammar from Part 1 (without the semantic actions that create parse trees). Give the command make pat2 to compile this program. Note that pati.ypp has the line:

// This says that semantic values of tokens should be pointers to string vectors
#define YYSTYPE vector<string>*

So your semantic values are pointers to vectors of strings.

Your job for Part 2 will be to make pati a full interpreter for the pat language (like you didn't need to do for lab05 - unless you did the extra credit). If you did Part 1 correctly, this is just going to mean changing the semantic actions for the grammar rules from my Part 1 statements (that build the pretty parse trees you saw) to statements of your own that will build up the sequences that should be the results of pat programs. If you review your lab04 solution, you will see things like this in your bison file:

exp: exp OPA term  { $$ = ($2 == '+' ? $1 + $3 : $1 - $3); }
\_______________/  \_______________________________________/
  grammar rule                 semantic action

... which says that when exp OPA term is reduced to an exp, the semantic value of the new exp ($$) will be either the sum or difference of the semantic values of the right-hand-side exp ($1) and the right-hand-side term ($3). Whether you get sum or difference depends on the semantic value of the OPA token ($2).

Important: Test thoroughly before submitting! Don't use the submit script to do your testing. Already from previous sections we've looked at some inputs and seen what the output patterns should be. They are a starting point for making your own tests! I'll give you one example (you should start much simpler!):
```
$ ./pati
[[w s t]*[a i a]]:X r X_r;
w a s i t a r a t i s a w   
```
Be sure to make use of the helper functions that I've already provided for you (see pat.h for prototypes, definitions in helper.cpp).
Remember: all semantic values in pati will be pointers to vectors of strings, i.e. vector<string>*. Note that if p is a vector<string>*, then to get an element, say at index 0, you'd write (*p)[0] , which derefences p to get the vector object, then subscripts that vector.
The global variable yylval, which the scanner uses to send semantic values to the parser, also has type vector<string>*. You're going to have to send ATOM's and NAME's as pointers to vectors of length 1.
The lab machines all have gcc/g++ version 7.4.0, which defaults to compiling with the C++11 standard. (You might want to check your VM's gcc/g++ version!) This has some nice features that may or may not be useful here or in the future.
- auto - C++ type names can get long. So it's frustrating to write things like
```
vector<string>* p = new vector<string>();
```
  where you have to write vector<string>* for essentially no good reason. After all, the whole point is to have p store the result of the expression new vector<string>(), which means the type of p should simply match that of the right-hand side. Well, the auto keyword makes that happen.
```
auto p = new vector<string>();
```
  tells the compiler to give p the type that matches the right-hand side.
- initializer lists - arrays, vectors, lists and other types have a convenient initializer syntax. For right now, it's the vectors we're after. If we wanted a vector of ints initialized to the first five primes:
```
vector<int> primes{2,3,5,7,11};
```
  If you wanted to heap-allocate a vector of strings initialized to cat, tac, act:
```
vector<string>* p = new vector<string>{"cat","tac","act"};
```
  Of course, we can add the auto keyword to make this:
```
auto p = new vector<string>{"cat","tac","act"};
```
  ... which is pretty compact and easy to follow.
Ask for help if you need it!

Parse trees for programs

Specifying precedence and associativity for LR parsers

Part 1: Using associativity and precedence to simplify pat

Part 2: Create pati, a full interpreter for pat

Submit