stdin, the fundmental operation is something
like getNextChar(), i.e. a function that returns
to the caller the next character from the input stream. For
scanning, the fundamental operation is something like
getNextToken(), i.e. a function that returns the
next token from the input stream. However, what type of
object should "a token" actually be? Usually, you give a
constant int value to each token, and that id is
certainly returned.
So, for example, you might say that there is a token NUM (for
numbers) that has id 1, and a token OP (for arithmetic
operators) that has id 2. Given input 34 + 172,
the first call to get the next token would produce 1, the second
call would produce 2, and the third call would produce 1, and
you would know that you had a NUM then OP then NUM.
However, for at least some tokens you
need more than that id. For example, a NUM token
needs the actual charactaers that were lumped together as a
NUM, or at least the int or
double they define. For a string constant
expression, you may prefer not to have the literal text, but
rather the text after replacing multi-character escape sequences like
\n with their appropriate char
value. So usually, the scanner will have to return several
pieces of information from getNextToken().
bison to create a simple
infix calculator from a grammar for arithmetic expressions.
If so, you probably ignored scanning.
Now we'll do this the other way round: we'll ignore parsing and
focus on the scanning!
Below is a file for the standard unix parser-generator "bison".
We will get to bison later. For now let it suffice that bison
can take this file
(which, among other things, defines a grammer
for simple arithmetic expressions)
and produce a .cpp file for a program to parse strings in the
language. In this case, it does more, because it not only
parses but also evaluates the arithmetic expressions.
The way bison works, it relies on someone or something else
providing it with the definition
of the function yylex(), which is bison's name for
"getNextToken()". In otherwords, it needs to be provided with a
scanner.
%{
// NOTE: I'm not expecting you to understand this file yet!
#include <iostream>
#include <string>
#include <cstdlib> //-- I need this for atoi
using namespace std;
//-- Lexer prototype required by bison, aka getNextToken()
int yylex();
int yyerror(const char *p) { cerr << "Error!" << endl; }
%}
//-- GRAMMAR SYMBOL DECLARATIONS
%union {
int val;
char sym;
};
%token <val> NUM
%token <sym> OPA OPM LP RP STOP
%type <val> exp term sfactor factor res
//-- GRAMMAR RULES
%%
res: exp STOP { cout << $1 << endl; YYACCEPT; }
exp: exp OPA term { $$ = ($2 == '+' ? $1 + $3 : $1 - $3); }
| term { $$ = $1; }
term: term OPM factor { $$ = ($2 == '*' ? $1 * $3 : $1 / $3); }
| sfactor { $$ = $1; }
sfactor: OPA factor { $$ = ($1 == '+' ? $2 : -$2); }
| factor { $$ = $1; }
factor: NUM { $$ = $1; }
| LP exp RP { $$ = $2; }
%%
//-- FUNCTION DEFINITIONS
int main()
{
while(1) yyparse();
return 0;
}
So, we need to provide a yylex() function that recognizes the tokens NUM, OPA (i.e. + and -), OPM (i.e. * and /), STOP (i.e. ;), LP (i.e. "(") and RP (")"). Oh, and it should also treat whitespace appropriately. Instead of sitting at the keyboard and trying to hack something together, we willx construct a finite automaton that recognizes tokens and base our implementation on it. Note: the names NUM, OPA, OPM, STOP, LP, and RP are available to us as integer constants. Bison defines them automatically.
![]() |
int yylex()
{
bool found = false;
int state = 0;
string val = "";
while(!found)
{
char c = cin.get();
switch(state)
{
case 0:
switch(c) {
case '0': case '1': case '2': case '3': case '4':
case '5': case '6': case '7': case '8': case '9':
val += c; state = 1; break;
case '+': case '-': val += c; state = 2; break;
case '*': case '/': val += c; state = 3; break;
case ';': val += c; state = 4; break;
case '(': val += c; state = 5; break;
case ')': val += c; state = 6; break;
case ' ': case '\t': case '\n': break;
case EOF: exit(0); break;
default: found = true; }
break;
case 1:
switch(c) {
case '0':case '1':case '2':case '3':case '4':
case '5':case '6':case '7':case '8': case '9':
val += c; state = 1; break;
default:
cin.putback(c);
found = true; }
break;
case 2: case 3: case 4: case 5: case 6:
cin.putback(c);
found = true;
break;
}
}
switch(state) {
case 0: return 0; // ERROR
case 1: yylval.val = atoi(val.c_str()); return NUM;
case 2: yylval.sym = val[0]; return OPA;
case 3: yylval.sym = val[0]; return OPM;
case 4: return STOP;
case 5: return LP;
case 6: return RP; }
}
|
One thing that may require a bit of explanation: if you read a
character with cin.get(), you can "unread it"
with cin.put_back(). That's a kind of convenient
trick here.
However ... you can't "putback" more than one
character. In other words, you can't do two cin.put_back()s
without having done at least one cin.get() in between them.
Here's the complete program:
to view or
to download.
Some interesting questions to ask at this point:
The "maximal munch" rule requires a little bit of care.
From the implementation perspective we run the input
characters through a DFA, collecting the characters as they
are read in a buffer (the variable val in the
above). We don't stop and emit a token just because we hit
an accepting state. We stop when we hit a "missing"
transition, i.e. a state/char combination for which there is
no outgoing arrow, and rewind until we hit the last
accepting state seen prior to the missing transition.
(Note: in the example above, this is easy because everything
other than the start state is an accepting state. That's
not always the case.)
This backing up requires that we put some characters back
into the input stream --- that we pretend we hadn't read
them after all. That is, in fact, what the
putback function in the example is doing.
C++'s putback function is only guaranteed to
work correctly for putting-back a single charcter between
reads. So, in general, we need to deal with this
ourselves. We keep a buffer buff
of unread characters, and when reading the next
character, instead of doing
char c = cin.get();... we check the buffer first.
if buff empty, c = cin.get(), otherwise c is next char from buffThen, when we need to "putback" multiple characters, we stick them on the end of
buff. Implementing this in a
correct and robust way is a bit tricky. You need a circular,
extensible buffer, which you should know about, but which is
pretty easy to mess up.
Here's an example of a lanugage construct that might require
more than one "putback" operation:
suppose that for some language "-" is a token,
and "-->" is a token, but "--" is not.
If we read "--@", what do we do?
Many languages get around this problem by defining their tokens so that every prefix of a valid token is itself a valid token. This actually guarantees you only ever need a single character's worth of "putback".
234.foo93

The general strategy for scanners is to throw away characters until you get something you're able to make a token out of. You should output an error message telling the user what got thrown away, of course, and if you're especially kind, you keep track of line numbers (and possible character position within a line) to help the user out. In our simple calculator example, each input is very small, usually just a singe line, so just printing out the offensive characters along with an error message is probably sufficient. In fact, all we really need is to add an extra transition from state 0 back to itself for all "other" characters, with the associated action of printing an error message, and the effect will be what we want.