SI413: Scanning

Reading

Section 2.2 of Praogramming Language Pragmatics

Homework

Do hw08. Please have Problem 2 done before lab (this is a faster turn-around than usual, but it has to be). There is a small program for Problem 2, which I would like submitted by that point. Problem 1 and the written submission will be due at the beginning of the next class, as usual.

Scanning / tokenization / lexical analysis

Recall, scanning is the process of breaking up a stream of symbols into a stream of tokens. This will usually include stripping out comments, inserting "include" files, etc. If you're reading from an input stream like stdin, the fundmental operation is something like getNextChar(), i.e. a function that returns to the caller the next character from the input stream. For scanning, the fundamental operation is something like getNextToken(), i.e. a function that returns the next token from the input stream. However, what type of object should "a token" actually be? Usually, you give a constant int value to each token, and that id is certainly returned. So, for example, you might say that there is a token NUM (for numbers) that has id 1, and a token OP (for arithmetic operators) that has id 2. Given input 34 + 172, the first call to get the next token would produce 1, the second call would produce 2, and the third call would produce 1, and you would know that you had a NUM then OP then NUM. However, for at least some tokens you need more than that id. For example, a NUM token needs the actual charactaers that were lumped together as a NUM, or at least the int or double they define. For a string constant expression, you may prefer not to have the literal text, but rather the text after replacing multi-character escape sequences like \n with their appropriate char value. So usually, the scanner will have to return several pieces of information from getNextToken().

Finite automata and scanning: hand-rolled approach

In theory of computing, you may have had a lab about using the parser generator bison to create a simple infix calculator from a grammar for arithmetic expressions. If so, you probably ignored scanning. Now we'll do this the other way round: we'll ignore parsing and focus on the scanning! Below is a file for the standard unix parser-generator "bison". We will get to bison later. For now let it suffice that bison can take this file (which, among other things, defines a grammer for simple arithmetic expressions) and produce a .cpp file for a program to parse strings in the language. In this case, it does more, because it not only parses but also evaluates the arithmetic expressions. The way bison works, it relies on someone or something else providing it with the definition of the function yylex(), which is bison's name for "getNextToken()". In otherwords, it needs to be provided with a scanner.

%{
// NOTE: I'm not expecting you to understand this file yet!
#include <iostream>
#include <string>
#include <cstdlib> //-- I need this for atoi
using namespace std;

//-- Lexer prototype required by bison, aka getNextToken()
int yylex(); 
int yyerror(const char *p) { cerr << "Error!" << endl; }
%}

//-- GRAMMAR SYMBOL DECLARATIONS
%union {
  int val; 
  char sym;
};
%token <val> NUM
%token <sym> OPA OPM LP RP STOP
%type  <val> exp term sfactor factor res

//-- GRAMMAR RULES 
%%
res: exp STOP { cout << $1 << endl; YYACCEPT; }

exp: exp OPA term     { $$ = ($2 == '+' ? $1 + $3 : $1 - $3); }
| term                { $$ = $1; }

term: term OPM factor { $$ = ($2 == '*' ? $1 * $3 : $1 / $3); }
| sfactor             { $$ = $1; }

sfactor: OPA factor   { $$ = ($1 == '+' ? $2 : -$2); }
| factor              { $$ = $1; }

factor: NUM           { $$ = $1; }
| LP exp RP           { $$ = $2; }

%%
//-- FUNCTION DEFINITIONS
int main()
{
  while(1) yyparse();
  return 0;
}

So, we need to provide a yylex() function that recognizes the tokens NUM, OPA (i.e. + and -), OPM (i.e. * and /), STOP (i.e. ;), LP (i.e. "(") and RP (")"). Oh, and it should also treat whitespace appropriately. Instead of sitting at the keyboard and trying to hack something together, we willx construct a finite automaton that recognizes tokens and base our implementation on it. Note: the names NUM, OPA, OPM, STOP, LP, and RP are available to us as integer constants. Bison defines them automatically.

int yylex()
{
  bool found = false;
  int state = 0;
  string val = "";
  while(!found)
  {
    char c = cin.get();
    switch(state)
    {
    case 0:
      switch(c) {
      case '0': case '1': case '2': case '3': case '4':
      case '5': case '6': case '7': case '8': case '9':
	val += c; state = 1; break;
      case '+': case '-': val += c; state = 2; break;
      case '*': case '/': val += c; state = 3; break;
      case ';': val += c; state = 4; break;
      case '(': val += c; state = 5; break;
      case ')': val += c; state = 6; break;
      case ' ': case '\t': case '\n': break;
      case EOF: exit(0); break;
      default: found = true; }
      break;
    case 1:
      switch(c) {
      case '0':case '1':case '2':case '3':case '4':
      case '5':case '6':case '7':case '8': case '9':
	val += c; state = 1; break;
      default:
	cin.putback(c);
	found = true; }
      break;
    case 2: case 3: case 4: case 5: case 6:
      cin.putback(c);
      found = true;
      break;
    }
  }
  
  switch(state) {
  case 0: return 0; // ERROR
  case 1: yylval.val = atoi(val.c_str()); return NUM;
  case 2: yylval.sym = val[0];            return OPA;
  case 3: yylval.sym = val[0];            return OPM;
  case 4:                                 return STOP;
  case 5:                                 return LP;
  case 6:                                 return RP; }
}

The way we use this automaton is as follows: When yylex() is called, we repeatedly read a character and move to the next state, until the combination of the state we are in and the character we read in has no transition (or we run out of characters). When that happens, we "unread" that last character so it will be waiting for next time. Then we look at our current state, if it isn't an accepting state, there's been an error. Otherwise, we look at the state's label, and that tells us what kind of token we've read in.

One thing that may require a bit of explanation: if you read a character with cin.get(), you can "unread it" with cin.put_back(). That's a kind of convenient trick here. However ... you can't "putback" more than one character. In other words, you can't do two cin.put_back()s without having done at least one cin.get() in between them. Here's the complete program: to view or to download.

Some interesting questions to ask at this point:

Fortran and some other older languages use ** for exponentiation. What if we wanted to add that to our calculator. How would this change our scanner?
How about adding comments? Different comments mean different problems for our scanner, of course. One popular style is that "#" starts a comment and newline "\n" finishes it.
In languages with string constants as tokens, we have a bit of a tricky issue: "-symbols usually delimit a string, of course. The problem arises from strings that should contain a "-symbol. As you all should know, languages usually have an "escape symbol", in Java and C++ it's "\", that you put in front of special symbols you want treated as regular characters. How would you scanner deal with string constants?
Some languages allow "multi-line strings". The question is, how can we delimit these monsters so that any string can apper, but we can unambiguously identify the end of the string. Some languages use three consecutive ";s to do this, i.e. """. Why three? What if you wanted to included """ in your multiline string? "

Maximal munch

Suppose we have ** for exponentiation and * multiplication, with EXP and OPM as the associated tokens. How does the lexer decide whether to lex ** as EXP or OPM OPM? Or what about ***? How should that lex? The usual rule of thumb in language design is to follow the rule of "maximal munch": whichever token matches the most characters is the one you take. Thus, *** lexes to EXP OPM. Similarly, if "for" matches a LOOP token, then "formula" lexes as the ID token with value formula, not LOOP ID where the ID matches "mula". Maximal munch also matches our need for lexing 4.13 as a single REALNUM token rather than INT DOT INT or something similar.

The "maximal munch" rule requires a little bit of care. From the implementation perspective we run the input characters through a DFA, collecting the characters as they are read in a buffer (the variable val in the above). We don't stop and emit a token just because we hit an accepting state. We stop when we hit a "missing" transition, i.e. a state/char combination for which there is no outgoing arrow, and rewind until we hit the last accepting state seen prior to the missing transition. (Note: in the example above, this is easy because everything other than the start state is an accepting state. That's not always the case.) This backing up requires that we put some characters back into the input stream --- that we pretend we hadn't read them after all. That is, in fact, what the putback function in the example is doing.

C++'s putback function is only guaranteed to work correctly for putting-back a single charcter between reads. So, in general, we need to deal with this ourselves. We keep a buffer buff of unread characters, and when reading the next character, instead of doing

char c = cin.get();

... we check the buffer first.

if buff empty, c = cin.get(), otherwise c is next char from buff

Then, when we need to "putback" multiple characters, we stick them on the end of buff. Implementing this in a correct and robust way is a bit tricky. You need a circular, extensible buffer, which you should know about, but which is pretty easy to mess up.

Here's an example of a lanugage construct that might require more than one "putback" operation: suppose that for some language "-" is a token, and "-->" is a token, but "--" is not. If we read "--@", what do we do?

Many languages get around this problem by defining their tokens so that every prefix of a valid token is itself a valid token. This actually guarantees you only ever need a single character's worth of "putback".

A multi-character putback example

Consider the FA below and how we would use it to tokenize: 234.foo93

After reading in 234, we'd be in state q1. Then we'd read in '.' and be in state q2. Then we'd read in 'f', and have no transition. So we'd back up to the last accepting state, which is q1, and putback both the 'f' we just read, and the '.' that had brought us to state q2 --- i.e. two characters of putback. The token we emit would be INT, since that's q1's label, and the associated input text would be 234.

Handling lexical errors

Normally we like to handle errors in a compiler/interpreter in a reasonable way. What's reasonable? Well you should give an error message that helps the user track down the problem, and it's nice if you don't puke on the first error, but continue on and try to process the rest of the input. After all, if there are several errors it's nice to see messages for them all at once.

The general strategy for scanners is to throw away characters until you get something you're able to make a token out of. You should output an error message telling the user what got thrown away, of course, and if you're especially kind, you keep track of line numbers (and possible character position within a line) to help the user out. In our simple calculator example, each input is very small, usually just a singe line, so just printing out the offensive characters along with an error message is probably sufficient. In fact, all we really need is to add an extra transition from state 0 back to itself for all "other" characters, with the associated action of printing an error message, and the effect will be what we want.