SI413: Specifying Syntax

Specifying programming languages: syntax vs. semantics

In defining or specifying a programming language, we generally distinguish between syntax and semantics. The syntax of a programming language describes which strings of of characters comprise a valid program. The semantics of a programming language describes what syntactically valid programs mean, what they do. In the larger world of linguistics, syntax is about the form of language, semantics about meaning. Linguistic giant Noam Chomsky gave this nice sentence:

Colorless green ideas sleep furiously.

... as an example of something that is syntactically correct, but without semantic value, i.e. meaningless even though it satisfies all our grammatical rules. It's a nice example to keep in mind, because it shows that semantics and syntax are really different.

In programming languages our expectations of what certain symbols should mean, like "+" or "if" may make us lose sight of the fact that both syntax and semantics really need to be defined to define a language. However, there are places where this becomes pretty clear. For example, in a C/C++ program the following is syntactically correct:

int x;
x = 2^3;

But what does it mean? I.e. what value does ther variable x end up with? Not eight, in fact, but one! You see, although we know it's syntactically correct, we don't know its semantics. Meaning must be defined as well as form. (BTW: "^" is bitwise xor in C/C++.) It's not just a question of spelling out what each keyword or operator symbol means, meaning of groups of symbols must also be defined. For example, what about the following code fragment:

void foo(int x, int y, int z)
{ 
  if (x < y < z)

In Java, this is actually syntactically incorrect. However, in C/C++ it is correct. But what does it mean? What are the semantics here? Just knowing what < means isn't enough to tell you, actually. You have to know <'s associativity, and you've got to know what type < returns ... and you even have to know about C/C++ type automatic type promotion rules are, and what true and false really are. And even then, it certainly doesn't mean what a newbie would expect. Syntax tells you whether or not x < y < z is valid in the language - and this depends on the language, as we've seen - but you need the semantics to know what the expression actually means.

The book looks first at specifying syntax, then later at specifying semantics. In fact, however, the delineation is not that clear cut. We act as if, roughly speaking, a grammar specifies the syntax of a language, and something else specifies the semantics. However, the grammar for a language does a bit of both, but certainly not all of either. You know this already from Theory of Computing, where we talked about ambiguity in grammars and how it was undesirable in applications. For example:

exp -> exp op exp | NUM
 op -> + | - | * | /

is a grammar that correctly defines the syntax of proper arithmetic expressions involving numbers, assuming NUM is properly defined. But it's ambiguous, and we went on to define a nice unambiguous grammar that met our expectations for associativity and precedence of arithmetic operators. But that's not neccessary if we're really only interested in specifying syntax. The grammar above tells you exactly which strings of symbols are valid arithmetic expressions ... associativity and precedence are issues of semantics, just as is the question of whether 5/2 gives 2.5 or 2.

In fact, it's not really possible in any practical sense to completely separate syntax and semantics, because the validity of an expression like a < b < c may depend (as it does in Java) on the meaning of the < operator, in terms of both types and associativity. None-the-less, it's useful to think of syntax and semantics as separate concerns, as the book does. We just tend to lump the semantic elements that are defined through things like the language's grammar as being part of the language's syntax.

Guess what? It's grammars and regular expressions!

We want to specify syntax so that programmers can write valid programs, and also so compiler-writers will produce compilers that can cope with the full range of valid programs people produce. In this sense, constructs that can provide both the language definition and the mechanisms for processing programs within compilers are golden. We already know two such constructs: regular expressions and context free grammars, and they are the foundation of the theory and practice of programming languages. Basically we use regular expressions to define the tokens of a language, which are sort of the atomic building blocks, and context free grammars to define how the tokens get combined to form valid programs. Traditionally, grammars for programming languages are given in EBNF --- Extended Backus-Nauer Form. This is just like the form we wrote our grammers in for Theory of Computing, except Kleene star and Kleene plus are allowed, along with ( )'s for grouping. This doesn't change the expressive power of the language, but it does make somethings a bit easier. For instance, we can express list of statments as:

stmt-list -> stmt+

instead of

stmt-list -> stmt | stmt-list stmt

With an unambiguous grammar that's designed to accomodate the semantics of your language, a parse tree for a program is a data-structure that represents the program in a very useful way. Thus, parsing is a huge step in compiling or interpreting a program. However, a true parse tree may have many nodes and edges that are artifacts of how the grammar encodes properties like precedence or deal with symbols like ( )'s or ;'s that are actually unneccesary once the parse is complete. Thus we prefer an abstract syntax tree to a true parse tree, which is kind of like a simplified parse tree. Here's a simple example. Consider the following grammar, which defines basic arithmetic expressions, complete with associativity, precedence, unary as well as binary "-", and ( )'s.

    exp -> exp OPA term | term
   term -> term OPM factor | sfactor  <-- NOTE: "sfactor" stands for "signed-factor", it handles unary "-"
sfactor -> OPA factor | factor
 factor -> NUM | LP exp RP

where OPA = +|-, OPM = *|/, LP = (, RP = ), and NUM = digit+.

Compare first the true parse tree, then an abstract syntax tree for the expression: -(131 + 14*872) - 3*(-8 + 4/122):

Hopefully this illustrates the difference between the parse tree, and an abstract syntax tree. So, our goal is to use regular expressions and grammars to define as much as possible of the syntax of our programming language, and use those to produce code to read in program text and produce abstract syntax trees.

Review Section 1.6: "An Overview of Compilation"

The process of compiling (or interpreting) a program normally progresses in phases. Section 1.6 of the book gives a nice overview of those phases, which are

lexical analysis (scanning)
parsing
semantic analysis / intermediate code generation
machine independent optimization [optional]
target code generation
machine-specific optimization [optional]

The basic picture you want to keep in mind is this:

	         scanning               parsing
character stream --------> token stream -------> parse tree -----> abstract syntax tree

Scanning transforms a stream of characters into a stream of tokens. Parsing transforms a stream of tokens into a parse tree. Although we keep this picture in mind, it turns out that in practice we usually convert the parse tree into an abstract syntax tree one the fly, i.e. as the parsing happens. In this case, we would never explicitly build the parse tree as a structure. None-the-less, it's there behind the scenes implicitly.

Reading

Homework

Specifying programming languages: syntax vs. semantics

Guess what? It's grammars and regular expressions!

Review Section 1.6: "An Overview of Compilation"