Try Firefox!

SI413 Lab 5: Recursive descent parsing

A different kind of language
This lab you'll be implementing an interpreter for a special language, called pat using recursive descent parsing. The language is a simple language for defining sequences. Here are some examples:
> ./pat
a b c;
a b c 
[a b c]_r;
c b a 
[a b b d]:X;
a b b d 
X X_r;
a b b d d b b a 
[a b c [d e]:X f g h] X_r;
a b c d e f g h e d 
Pattern symbols (called "atoms") are strings beginning with lower-case letters (such as 'a', 'b', or 'cat'), pattern variables are strings beginning with upper-case letters. Square brackets are used for grouping. A sequence followed by : NAME is assigned to a variable as a side effect. That variable is in scope from that moment on until the interpreter is exited (with ctrl-d). The _r operator reverses a sequence. It's precedence (and variable assignment) are higher than concatenation, so a b [c d]_r gives a b d c, not d c b a. Finally, there's an operator * that interleaves two sequences, so
[a b c] * [x y z];
a x b y c z
This operator has lowest precedence, so the [ ]'s above are unnecessary. If the sequences have different lengths, the shorter is padded to the same length by repeating the last entry. Here is the grammar for a pat program, which is just a sequence of ;-terminated pat expressions.

S     -> seq STOP S | λ                    ← STOP is ;
seq   -> seq FOLD cseq | cseq              ← FOLD is *
cseq  -> cseq pseq | pseq
pseq  -> pseq COLON NAME | pseq POP | aseq ← POP is _r
aseq  -> ATOM | NAME | LB seq RB           ← LB and RB are [ and ]

A recursive descent paser for pat: Part I
In this part of the lab, I provide a flex scanner file and an abstract grammar for the language, you create the recursive descent parser. For the moment, your parser should simply accept valid program strings and print out an error message & exit in the presence of a syntax error.

The grammar above is probably the natural grammar for this language, at least if you want to specify the associativities and precedences of the language's operators. However, it is not appropriate for LR (top-down) parsing. Please stop for a moment, look at the grammar and make sure you understand why. Which rules are troublesome?

For purposes of this lab, I'm going to give you a rewriting of the grammar in a form that is amenable to top-down parsing (though I'd like you to be able to do it yourselves!):

S     -> seq STOP S | λ
seq   -> cseq stail
stail -> FOLD cseq stail | λ
cseq  -> pseq ctail 
ctail -> pseq ctail | λ
pseq  -> aseq ptail
ptail -> COLON NAME ptail | POP ptail | λ
aseq  -> ATOM | NAME | LB seq RB
You write a recursive descent top-down parser. You should refer to Class 9 to see how this works. I've got the following shell for you:

/*** This is file: pat.l ***/
#include "pat.h"
#include <iostream>
using namespace std;

%option noyywrap

[a-z][a-zA-Z0-9\^]*    { return ATOM; }
"*"      { return FOLD; }
[;]      { return STOP; }
":"      { return COLON; }
[A-Z][a-zA-Z0-9\^]*   { return NAME; }
"_r"     { return POP; }
"["      { return LB; }
"]"      { return RB; }
<<EOF>>  { return 0; }
[ \t\n]+ { }
.        { cerr << "Unrecognized token!" << endl; exit(1); }
/*** This is file: pat.h ***/

// Token labels
const int ATOM=1, FOLD=2, STOP=3, COLON=4, 
          NAME=5, POP=6, LB=7, RB=8;

// Prototype for yylex(), aka getNextToken()
int yylex();
/*** This is file: pat.cpp ***/
#include "pat.h"
#include <iostream>
using namespace std;

int next = -1; // store next token

//-- Helper functions
void perror() { 
  cerr << "Parse error!" << endl; 
  exit(1); }
int peek() { 
  if (next == -1) next = yylex(); 
  return next; }
void match(int tok) { 
  if (tok == peek()) next = -1; 
  else  perror(); }

You might also want to create a Makefile:

	flex pat.l
	g++ -o pat lex.yy.c pat.cpp
Don't forget the tab's! From this skeleton you actually need to write the recursive descent parser! When it works, you should be able to enter statement after statement with no feedback, until a syntax error cases a message and aborts the program. For example:
> ./pat
a b c;
[a b c]:X X_r;
[ a b : c d ];
Parse error!

A recursive descent paser for pat: Part II
Now it's time to build a functioning interpreter for the pat mini-language. I suggest you look at this explanation of recursive descent for the simple calculator language. See how semantic values (which are all int's for the calculator) are handled? See how we evaluate across those akward "tail" productions?

Now, for your interpreter I suggest that the tokens have C++ string objects as semantic values, and that non-terminals have vector<string> objects for semantic values, i.e. that's what the grammar rule functions return. Because I like you, I've implemented a few functions you may find helpful, they implement the fold (*) operation, concatenation and the reverse (_r) operattion:

#include <vector>
#include <string>
#include <algorithm>

vector<string> fold(const vector<string> &A, const vector<string> &B)
  vector<string> res;
  for(unsigned long int i = 0; i < A.size() || i < B.size(); ++i)
  { res.push_back(A[min(i,A.size()-1)]); res.push_back(B[min(i,B.size()-1)]); }
  return res;
vector<string> concat(const vector<string> &A, const vector<string> &B)
  vector<string> res;
  for(int i = 0; i < A.size(); ++i) res.push_back(A[i]);
  for(int i = 0; i < B.size(); ++i) res.push_back(B[i]);
  return res;
vector<string> rev(const vector<string> &A)
  vector<string> res;
  for(int i = A.size() - 1; i >= 0; --i) res.push_back(A[i]);
  return res;
  1. Instead of implementing the whole thing at once, pretend that the "tail" rules all just go to λ, and ignore variables. Just return empty vector<string>'s for the cases you want to ignore. Get that working then add the non-λ ctail → pseq ctail rule. Get that working then add the rest. Leave variables for last.
  2. Pass vectors by reference ... make them const too, it'll just make life easier. Remember to #include string and vector and algorithm (which defines min & max).
  3. Don't define variables inside a switch/case statement. There are potential problems that are best simply avoided. Instead, define them before the switch, and assign values inside the switch/case.
  4. You'll need a symbol table for pat-language variables. Once again we'll use maps, but this time we're mapping strings to vectors of strings, i.e.:
    map<string, vector<string> > symTable;
    Make sure you use that spaces in between > and >. Why? Becauase C++'s scanner treats >> as a single token, not as two >'s.

What to turn in
Submit a working version of the pat interpreter. Be sure to document it, though. Your makefile should make the program, of course, and it should still be called pat. There should be a README file that contains your name as well as a brief description of the program.

Assuming your lab is in a directory /home/m101111/si413/lab05, you would submit as follows:

cd /home/m101111/si413
/courses/wcbrown/etc/si413-submit lab05
You may resubmit later, I'll simply look at the submission with the latest time stamp.

Christopher W Brown
Last modified: Wed Sep 23 14:43:36 EDT 2009