Class 28: Parsing with bison


Reading
None.
Homework
Turn in the source code for a solution to Lab Problem 2 below and a screen capture showing it compiling and running. If you can do it, I'd be happy to get a Lab Problem 3 solution instead, or maybe something with a few extra features like ^ for exponentiation, etc.

bison: a parser generating program
In this lab we'll look at a practical tool using grammars and the theory of context free languages. This tool is called bison, and it is a standard Unix tool. It's the Gnu version of a program called yacc (yet another compiler compiler) that's been a part of the Unix environment since before any of you were born. However, it still gets used all over the place and has been ported to other languages, like Java. Bison/yacc is used for a variety of purposes, but principally for building parsers for compilers and interpreters. Essentially, you give bison a grammar, and it gives you a C++ function that parses strings using that grammar. This function is, more or less, a push down automaton.

The basic idea is this: you write a grammar along with, possibly, some C++ code. You might call this source file file.ypp. Then you run bison on this file, much like a compiler. The result is a file of C++ source code called file.tab.cpp, which contains all the C++ you put in file.ypp, as well as the C++ function definition for a function yyparse, which is a parser built from your grammar. You then simply compile file.tab.cpp using g++ to get an executable program.

A first bison example
For our very first example, we are going to use bison to write a program that determines whether a simple arithmetic expression using +, - and single-digit numbers is syntactically correct. Now, your grammar in bison is a grammar whose terminals are tokens, not characters. You get to determine what the tokens are, of course, but they're just names and do not directly correspond to chars. The grammar we'll use is exp → exp OPA NUM | NUM, where the terminals (or tokens) are Σ = {OPA,NUM}. The token OPA stands for an addative operator, i.e. + or -. The token NUM stands for a 1-digit number.

Below is a bison input file for this problem. If you name this file lab1.ypp, you create your program and run it like this:

valiant[1] [~/]> bison lab1.ypp
valiant[2] [~/]> g++ -o lab1 lab1.tab.cpp
valiant[3] [~/]> lab1
1+2-3
valiant[4] [~/]> lab1
1+2+-3
Error!
valiant[5] [~/]> 
A bison input file consists of three basic sections. I've presented this input file with different background colors to show the different sections. The characters printed in red are special markers that tell bison where the different sections start and end.

lab1.ypp: a bison input file
%{
#include <iostream>
using namespace std;
int yylex();
int yyerror(const char *p);
%}
This section of your bison file contains C++ code delimited by %{ and %}. In our case we're simply putting include's and the prototypes for two functions - yylex and yyerror - that bison requires us to define. The definitions appear later.
%token NUM OPA

%%

exp: exp OPA NUM 
   | NUM
   ;

%%
This section of the code is the actual grammar. First it lists the tokens, which are the terminals of the grammar. These follow the %token. Then it gives the grammar rules, sandwiched between %%'s. The syntax of the grammar rules looks much like what we're used to, except. The rule A → u | v | w is written as A: u | v | w ;. Whitespace and newlines don't affect anything.
int yylex()
{
  char c = cin.get();

  if ('0' <= c && c <= '9') { return NUM; }
  
  if (c == '+' || c == '-') { return OPA; }

  if (c == '\n') return 0;

  cerr << "Unknown Token!" << endl;
  return -1;
}

int yyerror(const char *p) { cerr << "Error!" << endl; }

int main()
{
  yyparse();
  return 0;
}
This section contains more C++ code. The primary purpose of this code is to define two functions that bison uses to do its parsing: yylex and yyerror. Bison calls yyerror when it's unable to parse a string - i.e. the string is not in the language. Bison calls yylex to get the next token. Remember, bison parses strings of tokens, not strings of characters. So it's up to yylex to read characters from the input and determine what the next token is. Notice that I return 0 when '\n' is encountered. This tells yyparse that the input is finished. Also notice that yylex doesn't deal with whitespace. When you use this program you can't include any spaces.

Also in this section is the function main that calls the function yyparse. This function is what bison is all about. It calls yylex over and over to get tokens, and tries to parse the resulting string of tokens according to the given grammar. If it fails the yyerror function gets called. If it succeeds, yyparse it simply returns.

Lab Problem 1: Adding * and /
Your first problem is to add to lab1.ypp above so that * (multiplication) and / (division) are allowed in our arithmetic expressions. If you're not careful about how you add this extended functionality, your grammar might be ambiguous. Bison doesn't like this, and will give you an error. Make sure your program complains for strings like 2+3*4*+3. Note: There's a cheap way of adding this functionality. but if you follow the cheap way, it may bite you later in the lab!

Adding actions to grammar rules
Bison allows you to add actions - in the form of C++ statements - to grammar rules. Every time the grammar rule gets used in parsing, the action is performed. To really understand this, we need to think back to how we converted a grammar to a PDA. The PDA would push terminals (tokens) onto a stack, and when a sequence of symbols on the stack matched the right-hand-side of a rule, they were popped off the stack and replaced with the left-hand-side's symbol. This pop and replace is referred to as a reduction, and a rule's C++ code is executed whenever a reduction by that rule takes place.
lab2.ypp: notice the actions associated with the rules Compiling and running lab2.ypp
%{
#include <iostream>
using namespace std;
int yylex();
int yyerror(const char *p);
%}

%token NUM OPA

%%

exp: exp OPA NUM { cout << "Rule 1 used!" << endl; }
   | NUM         { cout << "Rule 2 used!" << endl; }
   ;

%%

int yylex()
{
  char c = cin.get();

  if ('0' <= c && c <= '9') { return NUM; }
  
  if (c == '+' || c == '-') { return OPA; }

  if (c == '\n') return 0;
  return -1;
}

int yyerror(const char *p) { cerr << "Error!" << endl; }

int main()
{
  yyparse();
  return 0;
}
valiant[1] [~/]> bison lab2.ypp
valiant[2] [~/]> g++ -o lab2 lab2.tab.cpp
valiant[3] [~/]> lab2
2+3-1
Rule 2 used!
Rule 1 used!
Rule 1 used!
valiant[4] [~/]>

What happened was the 2 was pushed on the stack, then reduced with Rule 2, resulting in the first line of output, and exp was pushed onto the stack. Then the + and the 3 were pushed, which gave us a stack matching Rule 1. So there was a reduction with Rule 1, resulting in the second line of output and the stack afterwards contained only the nonterminal exp. Then - and 1 were pushed and a final reduction with Rule 1 took place.

Now the actions we attached to rules in the previous example weren't very interesting. More interesting would be attaching actions that allow us to evaluate expressions. The thing about this, is that in the rule

exp → exp OPA NUM
we need to know the value of the exp on the right-hand-side, and the values of the tokens OPA and NUM. Fortunately, Bison allows you to attach values to each occurences of symbols in your grammar. When yylex returns a token, Bison will take the contents of an integer global variable (gasp! Did he say global variable?) yylval as the value to be attached to that token. But what about the value of non-tokens like the exp from the right hand side? Well, if you number the symbols on the right-hand-side, bison stores the values of the symbols (if it knows them) in $1, $2, etc. So inside the action code, $1 refers to the value of the exp from the right hand side. When a reduction takes place, you can assign a value to a special variable named $$ to tell bison what value to attach to the symbol that you reduce to, i.e. the left-hand-side of the rule. Putting this all together, you get the following bison source file. One thing I've added is an "=" at the end of the expression. That tells bison to stop parsing and print out the answer.

lab3.ypp: evaluating simple arithmetic expressions Compiling and running
%{
#include <iostream>
using namespace std;
extern int  yylval; // Shhh! a global variable!
int yylex();
int yyerror(const char *p);
%}
valiant[1] [~/]> bison lab3.ypp
valiant[2] [~/]> g++ -o lab3 lab3.tab.cpp
valiant[3] [~/]> lab3
2-3+5-5-2=
-3
valiant[4] [~/]> lab3
2-3+5++2=
Error!
valiant[5] [~/]>
%token NUM OPA STOP

%%

res: exp STOP    { cout << $1 << endl; }
   ;

exp: exp OPA NUM { if ($2 == '+') 
                     $$ = $1 + $3;
                   else 
                     $$ = $1 - $3; }
   | NUM         { $$ = $1; }
   ;

%%
int yylex()
{
  char c = cin.get();

  if ('0' <= c && c <= '9') { yylval = c - '0'; return NUM; }
  
  if (c == '+' || c == '-') { yylval = c; return OPA; }

  if (c == '=') { return STOP; }

  if (c == '\n') return 0;
  return -1;
}

int yyerror(const char *p) { cerr << "Error!" << endl; }

int main()
{
  yyparse();
  return 0;
}

Lab Problem 2: Evaluating with * and /
Your next job is to agument your solution to the first part of your lab (or augment lab3.ypp above) so that you actually evaluate arithmetical expressions with +,-,*,/. You've got to get the precedence and associativity right, so:
3-5*2= should give -7, not -4
8/2/2= should give 2, not 8
	
Lab Problem 3: Add parentheses
Your final lab problem is to add on to your previous solution so that parenthesized expressions are also handled. For example, 3*(3+4)= should be accepted, and evaluate to 21! Arbitrarily nested parenthesese should work, e.g. (2+((3)))*(5)= should be accepted and give 25.

And then ...
You could add exponentiation. You could lose the 1-line limit (i.e. not return 0 for newlines, just ignore them) and let the user do calculations until he types q for quit. You could introduce variables! You could ... create your own programming language. The real next step, though, is to stop using rule-actions to evaluate expressions, and start using them to create parse trees. Then, you can evaluate the parse trees.


Christopher W Brown
Last modified: Wed Nov 4 08:44:27 EST 2009