Menu

SI413 Class 9: Recursive descent and table-driven top-down parsing


Reading
Section 2.3.2 of Programming Language Pragmatics. Don't worry too much about the portion entitled "Predict Sets", pp. 79-82, but do pay attention to the section "Writing an LL(1) Grammar".

Homework
Suppose you wanted to add a feature to the calculator for computing max, so you could enter:
max{3, (2+1)*7/4, 68 - 3*4*5 - 9/2};
... and get the answer 5. Write out the grammar for our language with this feature added, along with telling me which new tokens would be required. (Just augment the grammar below.) Turn in: A list of the new tokens, the augmented grammar and a parse tree using your augmented grammar on the string 1 + max{3,7,-2};. I'd be pretty excited if you implemented it on top of the recursive descent parser below, by the way, but I won't require it.

Top-down parsing
We'll be looking at two kinds of top-down parsers today, and for both we'll use the following example grammar:
0: stmt → exp STOP
1: exp → term etail
2: etail → OPA term etail
3: etail → λ
4: term → factor ttail
5: ttail → OPM factor ttail
6: ttail → λ
7: factor → LP exp RP
8: factor → NUM
Where e is the empty string. We can use the same flex scanner we used with the previous lab.
[0-9]+   { return NUM; }
[\+|\-]  { return OPA; }
[\*|/]   { return OPM; }
"("      { return LP; }
")"      { return RP; }
";"      { return STOP; }
<>  { return 0; }
[ \t\n]+ { }
.        { cerr << "Unrecognized token!" << endl; exit(1); }

Recursive descent parsers
Recursive descent parsers implement the abstract top-down parsing we discussed last class with a twist: instead of using an explicit stack data structure, we implicitly use the function call stack. Thus, instead of pushing an exp symbol, we call an exp() function (which the system pushes onto the call stack). When there's more than one right-hand-side for a given non-terminal, like factor, we select based on the look-ahead.

#include "rd.h"

//-- Prototypes and globals
void stmt(); void exp(); void etail(); 
void term(); void ttail(); void factor(); 
int next = -1;

//-- Helper functions
void perror() { cerr << "Parse error!" << endl; exit(1); }
int peek() { if (next == -1) next = yylex(); return next; }
void match(int tok) { if (tok == peek()) next = -1; else  perror(); }

//-- Grammar rule functions
void stmt() { exp(); match(STOP); }
void exp() { term(); etail(); }
void etail() {
  switch(peek()) {
  case OPA: match(OPA); term(); etail(); break;
  case RP: case STOP: break;
  default: perror(); break;
 }
}
void term() { factor(); ttail(); }
void ttail() {
  switch(peek()) {
  case OPM: match(OPM); factor(); ttail(); break;
  case RP: case STOP: case OPA: break;
  default: perror(); break;    
  }
}
void factor() {
  switch(peek()) {
  case LP: match(LP); exp(); match(RP); break;
  case NUM: match(NUM); break;
  default: perror(); break;
  }
}

int main() { stmt(); }
#include <iostream>
using namespace std;

const int NUM = 0, OPA = 1, OPM = 2, LP = 3, RP = 4, STOP = 5;

int yylex();
%{
#include "rd.h"
%}

%option noyywrap

%%

[0-9]+   { return NUM; }
[\+|\-]  { return OPA; }
[\*|/]   { return OPM; }
"("      { return LP; }
")"      { return RP; }
";"      { return STOP; }
<<EOF>>  { return 0; }
[ \t\n]+ { }
.        { cerr << "Unrecognized token!" << endl; exit(1); }
%%

You can get rd.h, rd.cpp and ex1.l and compile like this:

flex ex1.l
g++ -o mycalc lex.yy.c rd.cpp
	
If the user enters a syntactically correct arithmetic expression, the program simply returns. Otherwise, it prints out an error message. In class we traced out the callstack for at least part of parsing "3+4;". Make sure you're clear on how that works.

Table-driven top-down parsers
Be sure to read the books description of table driven top-down parsing. They give a general-purpose view of things. We looked at the specific example of a table-driven top-down parser for our simple calculator language. The idea is to explicitly model our abstract stack-based top-down parser. We need to have both tokens and non-terminal symbols on the stack, which means we need a common type. We're simply assigning an int to each of these symbols so a stack of int's suffices.
const int NUM = 0, OPA = 1, OPM = 2, LP = 3, RP = 4, STOP = 5;
const int stmt = 10, exp = 11, etail = 12, term = 13, ttail = 14, factor = 15;
const int error = -1;
const int startSym = stmt;
Using the above, we can tell that a symbol is a token by testing whether it's less than startSym. The scanner function yylex() will just return the int value of the next token, like always, and we'll use the same helper functions we used above. Here's the parser, which will require a bit of explanation:
void parse()
{
  stack<int> S;
  S.push(startSym);
  while(true)
  {
    int sym = S.top(); S.pop();
    if (sym < startSym)
    {
      match(sym); if (sym == STOP) return;
    }
    else
    {
      int action = parseTable[sym - startSym][peek()];
      if (action == error) pperror();
      int k = 0; while(rule[action][k] != -1) ++k;
      while(--k >= 0) S.push(rule[action][k]); 
    }
  }
}
The main idea is that the parser has a table, parseTable, that it uses to determine which grammar rule to use next. It pops symbol sym and finds token tok with the lookahead, and parseTable[sym][tok] tells it what rule to use. We also keep the rules in an array so we can look up the rule and iterate over the symbols in the right-hand side. If a certain symbol/token pair indicates a parse error, we indicate it in the table with -1.
//-- These are the grammar rules. -----------------//
const int rule[9][10] = {
  { exp, STOP, -1 },          // Rule 0
  { term, etail, -1 },        // Rule 1
  { OPA, term, etail, -1 },   // Rule 2
  { -1 },                     // Rule 3
  { factor, ttail, -1 },      // Rule 4
  { OPM, factor, ttail, -1 }, // Rule 5
  { -1 },                     // Rule 6
  { LP, exp, RP, -1 },        // Rule 7
  { NUM, -1 }                 // Rule 8
};

//-- parseTable[symbol][token] = predicted rule. --//
const int parseTable[6][10] = {
  /*             NUM OPA OPM  LP  RP  STOP */
  /*  stmt  */ {   0, -1, -1,  0, -1, -1 },
  /*  exp   */ {   1, -1, -1,  1, -1, -1 },
  /*  etail */ {  -1,  2, -1, -1,  3,  3 },
  /*  term  */ {   4, -1, -1,  4, -1, -1 },
  /*  ttail */ {  -1,  6,  5, -1,  6,  6 },
  /* factor */ {   8, -1, -1,  7, -1, -1 }
};

The cool thing about this setup is that the parse() function is the same for any LL(1) language! All that changes from run to run is the list of grammar rules and the table! Automatically generated top-down parsers are of this kind. Of course generating the table takes some doing. In class we talked about FIRST and FOLLOW, which the book describes in considerably more detail.

You can get rd.h, rd.cpp and ex1.l and compile like this:

flex ex1.l
g++ -o mycalc lex.yy.c rd.cpp
	
... if you want to test this version out.


Christopher W Brown
Last modified: Wed Sep 23 22:27:11 EDT 2009