Try Firefox!

SI413 Lab 4: Lexical analysis with flex


Flex basics and flex/bison interactions
Flex is a tool that generates scanners. One of its main purposes is to provide yylex() functions for bison-generated parsers. Flex and bison are GNU versions of lex and yacc, which are traditional Unix development tools. They both have manuals online: flex manual, bison manual.

  1. Read Example 1 and make sure you understand what flex is doing and how flex and bison interact and communicate.

  2. Create a directory calc1 and save the three Example 1 files ex1.ypp, ex1.l, and Makefile into it. Compile and run. [WARNING: throughout the course, make sure the browser doesn't add some extension like .txt to the filenames. If necessary, change the file type to "All files" before saving.]

  3. Modify the above by adding the "b" for binary constants, as per our homework. This should change only the flex file. Notice how the "longest-match"/"maximal-munch" policy is our friend here!

  4. Modify the above so that terminating a statment with ";b" instead of ";" results in the output being printed in binary. To make your life easier ...
    // print n as a signed binary number
    void printsbin(int n)
    {
      if (n < 0) { cout << "-"; n = -n; }
      char res[33] = { '\0' };
      int i;
      for(i = 31; n != 0; --i, n /= 2)
        res[i] = (n & 1 ? '1' : '0');
      cout << (i == 31 ? "0" : res + i + 1);
    }
    

    Now this is going to require modifying the bison file ex1.ypp as well as the flex file. It should be a simple change if you understand the system, though: you need to define a new token (I might call it STOPB, or something like that), you need to modify your flex file to recognize the new token, and you need to create a new rule in the grammar to use the token.

    > ./mycalc
    10+3*5;
    25
    10+3*5;b
    11001
    10b+3*5;b
    10001
    10b+3*5;
    17
    
    Once you've got this working, flag me down and show me!

  5. Next we add comments! They're not too exciting in a simple calculator, but comments are always a good thing to have. C-sytle comments /* like this */ are not completely straightforward with flex. Why? Well, suppose you added the flex rule:
    "/*".*"*/" { }
    
    This doesn't work because .* doesn't capture \n's, and the whole point of C-style comments is to allow multi-line comments. So we make it like this:
    "/*"(.|\n)*"*/" { }
    
    This doesn't work either, and the problem is more fundamental. Remember that the rule is that the longest match possible is the one flex makes, so in
    /* start */ x = 4; /* end */
    
    flex thinks the whole string is a comment. In this case, we don't want maximal munch! (We have the same problem with string literals.) So we'll take the easy way out and use another common comment convention: A "#" character starts a comment and the comment extends to the next newline character. Here's an example session:
    > ./mycalc
    1+2+3+4+5+6+7+8+9+10; # sum 1..10 computed explicitly
    55
    10*(10+1)/2; # sum 1..10 computed by Gauss's closed form
    55
    
    Note that although the newline is part of the comment, the comment does separate tokens. So
    1000 # beginning of one million
    000 # end of one million
    
    is interpreted as 1000 followed by 0, which gives an error. Not a lexical error, though, rather a parse error!

Food for thought: One of the real advantages of using a tool like flex as opposed to writing your scanner by hand is that adding new tokens usually has no effect whatsoever on what you've already done. You simply add a new line to your flex file. When you do this by hand, that's not the case: a new token may change the underlying finite automaton substantially.

Variables: Adding a symbol table
Our calculator is getting a bit more interesting. One thing it doesn't allow, though, is variables. So let's add variables. We won't requite any declarations, having a name on the left-hand side of an assignment will be enough to create it. Syntax-wise, we have some decisions to make:

  1. What should we use as an assignment operator?   Some popular choices have been :=, =, to as in x to 1, and ← as in x ← 1. You can decide for yourself.

  2. What strings can be used as variable names?   In Scheme almost anything is okay as a variable name. (Can you think of a string that isn't OK?) In Perl all variable names for numeric values like ours are prefixed with a $ symbol. In C/C++ and Java you have letters, digits and underscores, along with the restrictions that a number cannot begin a variable name. In defining something like this, the "character class" syntax for regular expressions is nice. For example, if you want C/C++ style variable names you'd use:
    [a-zA-Z_][a-zA-Z0-9_]*
    
    Let's require variable names to be alpha-numerical, with a non-numeric initial character.

Semantically, we have a few decisions as well:

  1. Should assignment be an expression, i.e. something that can be nested within other expressions?   Let's say no. We'll fit assignment into the grammar as: res: NAME ASN exp STOP, where NAME and ASN are new tokens.

  2. What should happen when an uninitialized variable name is used, i.e. when a name gets used before it's been given a value?   Let's print an error message and abort.

Finally, we have some technical implementation issues to worry about.

  1. How will we implement a symbol table? The C++ Standard Template Library (STL) is an extremely powerful language feature ... not always super easy to use, but very powerful. One of its most useful "containers" is the map. A "map" is essentially the abstract data type the red-black tree was invented to implement. It's a lot like an array that you don't need to index by consecutive integers, a structure called an associative array. Here's an example that uses a map mapping C++-style strings to int's:
    #include <string>
    #include <iostream>
    #include <map>
    using namespace std;
    
    int main()
    {
      map<string,int> T;
      string s;
      for(int i = 1; cin >> s && s != "done"; ++i)
        T[s] = i;
      cout << endl;
      
      cin >> s;
      if (T.find(s) == T.end())
        cout << "String " << s << " was not entered." 
             << endl;
      else
        cout << "String " << s << " was the "
             << T[s] << "th string entered." << endl;
    }
    
    foo
    bar
    star
    cat
    done
    
    cat
    String cat was the 4th string entered.
    Note that you must #include <map> to use maps. We can simply declare our symbol table as a map in the initial declarations section of the bison file.

    The way this works is that T is initially empty, but as soon as you subscript it with something, like T[string("the")], the mere act of subscripting creates an entry with key "the". One thing about this is that you can't really test to see whether there's already an entry for "the" this way. The way you make that test is: if (T.find(string("the")) == T.end()) ....

  2. How will we pass names as semantic values? The natural way to pass names around as semantic values is to add a field of type string (i.e. C++-style string) in the %union statment, but that doesn't work. Bison is unable to handle any kind of object there that has a constructor. So ... we will pass around C-style strings instead. So you'll add something like
    %union {
      int val;
      char sym;
      char *str;
    };
    to your bison file so that you can handle semantic values that are C-style strings. One pitfall: in your flex file, yytext is a C-style string giving the text that matched the token's pattern. However, you can't just assign that to yylval.str, because flex may use yytext's memory for later scanning. Instead you need to make a copy of yylval.str. The strdup function from C's string library (that'd be #include <cstring> in C++) does that for you. You can check it out with man strdup. This approach of passing C-style strings as semantic values is fine for this lab, but it has some long term issues, since we may "lose" strings before we have a time to free the memory they use.

  3. Teaching bison about char* semantic values. In your action rules, you've already used things like $1, $2, and $$. Did you notice that seemed to magically do the right thing even though some of the values were integers, and some were characters? No magic really -- the type of each symbol (terminal or non-terminal) depends on what is defined with the %token keywords, just under the union declaration. So.... you'll need to do the right thing with a %token declaration to deal with the 'str' part of the union that you just created above.

When you've got this all working, you should be able to run something like this:

> ./mycalc
n1 := 7;
n2 := 93;
# Now we compute the sum n1 + ... + n2
n2*(n2 + 1)/2 - (n1-1)*n1/2;
4350
# Now in binary!
n2*(n2 + 1)/2 - (n1-1)*n1/2;b
1000011111110

Last part: your choice
Now implement a mystery feature of your choice.
What to turn in
Submit a working version of the calculator with variables, binary input/output features and the mystery feature of your choice. Be sure to document the mystery feature. Your makefile should make the program, of course, and it should still be called mycalc. There should be a README file that contains your name as well as a brief description of the calculator program. This is where your new feature should be described AND be sure to include some exmaple input in the README that shows off your new feature.

Submit your whole lab04 directory electronically by following the directions on the course home page. AND turn in hardcopy printout of your files.


Christopher W Brown
Last modified: Tue Oct 6 11:19:45 EDT 2009