Lab 4: Intro to Flex and Bison

This is the archived website of SI 413 from the Fall 2013 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.

This lab is due at 2359 on Thursday, 19 September. It should be submitted as "413 lab 04", with the files named properly according to the submit script. See the submit page for details. You are highly encouraged to submit whatever you have done by the end of this lab. Remember that duplicate submissions are fine; I will just grade the latest one when the time comes.

Starter Code

You can get all these files at once by downloading the file lab04.tar.gz and running tar xzvf lab04.tar.gz

Testing

Until further notice, all our labs will consist of writing interpreters for (at first) simple programming languages. You are still expected to write two good tests per exercise, submitted in the folder tests/, and with file extension .txt.

Each test file will look like:

The input is
up here.
(Possibly on multiple lines.)
;
The expected output is down here.
See the line with the semicolon?
That's what separates them.

This is mostly the same as our previous labs, except that instead of the top part being a Scheme code fragment, the top part will be input (passed to stdin) to your program. Again, each test file name should start with something like ex01 to indicate which exercise is being tested.

You can use the iotest command to run a single test manually and see the results, just like schemetest worked before.

Flex basics and flex/bison interactions

Flex is a tool that generates scanners. One of its main purposes is to provide yylex() functions for bison-generated parsers. Flex and bison are GNU versions of lex and yacc, which are traditional Unix development tools. They both have manuals online: flex manual, bison manual.

The files calc.lpp and calc.ypp in the starter code define a simple calculator program. It follows a similar grammar to what we'll be going through in Unit 4, and is pretty self explanatory: addition, subtraction, multiplication, and division with numbers - using parenthesis too - is what's supported at the moment.

What do flex and bison do?

Look at the Makefile. We start off with the file calc.ypp that defines the grammar and action rules for our calculator. The call

bison -d calc.ypp
creates files calc.tab.cpp and calc.tab.hpp. The first contains the C++ code for the parser (which is a function called yyparse()), the second is a header file that contains, among other things, the definitions of the token names NUM, OPA, etc., and the variable yylval that we use to pass the bison code the semantic values of tokens (for instance, the numerical value). The flex file calc.lpp needs those definitions so that the yylex() function it defines for bison can pass back the information it needs to pass back.

The next thing we have is the flex file calc.lpp. The call

flex -o lex.yy.cpp calc.lpp
creates the file lex.yy.cpp that contains, among other things, the definition of the yylex() function. Of course we haven't talked about flex, so this file is new. The basic idea is that you write regular expressions for the different token types, and actions for each regular expression. The flex generated scanner code tries to match characters from the current input stream to these regular expressions, and when a match is found, it executes the associated action code. The variable yytext contains the string (in the C sense, i.e. null-terminated char array) of characters that were matched. When more than one match is possible, it breaks ties by going with the longest match - i.e., it follows the "maximal munch" rule.

When there are several "longest" matches, it goes with the regular expression listed first in the flex file. Of course, you should realize that flex is simply creating the same program implementing a finite automaton that we talked about in class. The difference is that we users specify tokens by regular expressions. So flex has to convert the regular expressions into finite automata for each token type, which are then combined into a single finite automaton. This is then converted, automatically, into a program.

Exercises

  1. Examine these files carefully and make sure you understand what flex is doing and how flex and bison interact and communicate. Run make and confirm that the calc program works as intended. (Nothing to turn in or test for this part.)

Adding Features to the Calculator

For the remainder of this lab, you will add features to the simple calculator program. These features will involve significant additions to the scanner (calc.lpp), and some to the parser (calc.ypp) as well.

Exercises

  1. Modify the scanner so that it accepts integers in either decimal or binary. Specifically, a string of only 0's and 1's that ends in the letter "b" (with no whitespace in between) should be interpreted as a binary integer. For example:

    $ ./calc
    > 10 + 3;
    13
    > 10b + 3;
    5
    

    This should only require changing the flex scanner specification file. The cstdlib function strtol will be useful here (check it out with man strtol). To convert a binary integer stored in the C string called str, you can write strtol(str, NULL, 2).

  2. Modify the scanner and parser so that terminating a statement with ";b" instead of ";" results in the output being printed in binary. To make your life easier, I've already included the following function in the starter code:
    // Prints n as a signed binary number
    void printbin(int n) {
      if (n < 0) { 
        resout << '-'; 
        n = -n; 
      }
      int bit = 1;
      while (bit > 0 && bit*2 <= n) bit *= 2;
      while (bit > 0) {
        if (bit <= n) {
          n -= bit;
          resout << '1';
        }
        else resout << '0';
        bit /= 2;
      }
      return;
    }
    Now this is going to require modifying the bison file calc.ypp as well as the flex file. It should be a simple change if you understand the system, though: you need to define a new token (I might call it STOPB, or something like that), you need to modify your flex file to recognize the new token, and you need to create a new rule in the grammar to use the token.
    $ ./calc
    > 10+3*5;
    25
    > 10+3*5;b
    11001
    > 10b+3*5;b
    10001
    > 10b+3*5;
    17
    
  3. Next we add comments! They're not too exciting in a simple calculator, but comments are always a good thing to have. C-sytle comments /* like this */ are not completely straightforward with flex. Why? Well, suppose you added the flex rule:
    "/*".*"*/" { }
    
    This doesn't work because .* doesn't capture \n's, and the whole point of C-style comments is to allow multi-line comments. So we make it like this:
    "/*"(.|\n)*"*/" { }
    
    This doesn't work either, and the problem is more fundamental. Remember that the rule is that the longest match possible is the one flex makes, so in
    /* start */ x = 4; /* end */
    
    flex thinks the whole string is a comment. In this case, we don't want maximal munch! (We have the same problem with string literals.) So we'll take the easy way out and use another common comment convention: A "#" character starts a comment and the comment extends to the next newline character. Here's an example session:
    $ ./calc
    > 1+2+3+4+5+6+7+8+9+10; # sum 1..10 computed explicitly
    55
    > 10*(10+1)/2; # sum 1..10 computed by Gauss's closed form
    55
    
    Note that although the newline is part of the comment, the comment does separate tokens. So
    1000# beginning of one million
    000# end of one million
    
    is interpreted as 1000 followed by 0, which gives an error. Not a lexical error, though, rather a parse error!

Food for thought

One of the real advantages of using a tool like flex as opposed to writing your scanner by hand is that adding new tokens usually has no effect whatsoever on what you've already done. You simply add a new line to your flex file. When you do this by hand, that's not the case: a new token may change the underlying finite automaton substantially.

One variable

Our calculator is getting a bit more interesting. One thing it doesn't allow, though, is variables. So let's add variables. We'll start with just a single variable called an accumulator. Observe that there are two ingredients to supporting a variable:

Exercises

  1. (Note: this exercise requires at least 4 test cases instead of the usual 2.) Add assignment and reference to a single accumulator variable to your calculator, as described above. Then we should be able to do:
    $ ./calc
    > ! 5;
    > @;
    5
    > ! 3 * 2;
    > @;
    6
    > @ + 4;
    10
    

Many variables

Now let's allow many variables instead of just one. This will involve creating a symbol table to map variable names to values.

We won't require any declarations; having a name on the left-hand side of an assignment will be enough to create it. Syntax-wise, we have some decisions to make:

Semantically, we have a few decisions as well:

Finally, we have some technical implementation issues to worry about.

Exercises

  1. (Note: this exercise requires at least 4 test cases instead of the usual 2.) Add a symbol table and variables as described above to your calculator program. Then you should be able to do something like:
    $ ./calc
    > x := 7;
    > y := 93;
    > # Now we compute the sum x + ... + y
    > y*(y + 1)/2 - (x-1)*x/2;
    4350
    > # Now in binary!
    > y*(y + 1)/2 - (x-1)*x/2;b
    1000011111110
    
    This is quite an undertaking! As a review of all that is written above, here are the steps you will need to take. If you stop and test along the way (when possible), it'll make your life much easier.
    1. Define two new token types NAME and ASN in the bison file.
    2. Make two new lines in the flex file for your new tokens. Don't worry about the semantic value yet; just return the token type.
    3. Make two new grammar rules in the bison file. The first is a production of stmt as shown above. The second is a new production of factor that's just a variable NAME. You won't be able to write the code for these grammar rules just yet, so for now just have them print out a helpful message.
    4. Make your symbol table. This will be declared with a line like
      map<string, int> ST;
      in the top part of your bison file, near the declaration for keepgoing.
    5. Now write the code for your grammar rules. For now, you won't be able to get the actual variable name strings! So I would just pretend like every variable is named "x" or something like that. You will have to assign to T[name] whenever a variable is assigned, and do a find whenever a variable is accessed (this would be in the rule for factor).
    6. OK, time to pass the actual variable names from the scanner to the parser. You will need to add a new line to the %union part of the bison file. Then open the scanner file and use the strdup function as discussed above to actually assign to yylval.
    7. Now just go back and modify the code for the grammar rule that does assignment to use the actual variable name (semantic value) instead of "x" or whatever you used.
    8. Test, test test!

Last part: A few more features

Exercises

  1. Allow single-line strings as comments, which should be ignored. If you use double-quotes like "a comment", then we should be able to do things like:
    > ./calc
    5 + 3; "A comment"
    8
    5 + "Another comment" 3;
    8
    
  2. Make variable names case-insensitive. So for example we can do
    > ./calc
    var := 10;
    VAR;
    10
    vAR := 11;
    var;
    11
    
  3. OPTIONAL: Add in support for two new math operators: exponentiation and remainder (i.e., mod).
  4. OPTIONAL: Allow numbers to be entered in scientific notation using the letter "e". Of course these will still be ints, so still no decimal points and the exponent can't be very big. It's up to you to specify the exact syntax, but this should not require any new grammar rules. For instance,
    > ./calc
    21e2;
    2100
    5e6;
    5000000
    
  5. OPTIONAL: Any other feature that you choose! (Make sure it doesn't cause something else to no longer work though.)