SI340: Parse Trees & Ambiguous Grammars

Technical definition of the language defined by a grammar

To give a technical definition of the language derived from some grammar, we need to define a single derivation step.

Definition: Let $G = (\Sigma,NT,R,S)$ be a context free grammar. For strings $u,v\in (NT\cup\Sigma)^*$, we say $u$ yields in one step in grammar $G$ $v$ (denoted $u \Rightarrow_G v$) if \[ u = yXz \text{ and } v = ywz \text{, where $y,z,w \in (NT\cup\Sigma)^*$ and $(X,w) \in R$} \] This is also sometimes expressed as $v$ can be derived in one step from $u$ in grammar $G$.

Just as with DFAs and NDFAs, it is helpful to define a special notation for derivable in "zero or more steps".

Definition: Let $G = (\Sigma,NT,R,S)$ be a context free grammar. For strings $u,v\in (NT\cup\Sigma)^*$, we say $u$ yields in zero or more steps in grammar $G$ $v$ (denoted $u \Rightarrow_G^* v$) if $u = v$ or if for some strings $w_1,w_2,\ldots,w_k$ we have: \[ u = w_1 \Rightarrow_G w_2 \Rightarrow_G \cdots \Rightarrow_G w_k = v \] This is also sometimes expressed as $v$ can be derived from $u$ in grammar $G$.

With these notions in hand, defining the language generated by a grammar is natural.

Definition: Let $G = (\Sigma,NT,R,S)$ be a context free grammar. The language generated by grammar $G$ (denoted $L(G)$) is $\{ w \in \Sigma^*\ |\ S \Rightarrow_G^* w\}$.

Note: The class of languages that are generated by context free grammars are called the Context Free Languages.

A quick grammar short-hand

It becomes tiresome to write grammars like

S → aSa
S → bSs
S → a
S → b
S → λ

in which many rules share the same left-hand side. So we adopt the short-hand convention that we can list many right-hand sides for the same left-hand side all on one line by separating them with the "|" character, like this:

S → aSa | bSs | a | b | λ

This is just a short-hand notation. The |'s are not terminal characters, in fact they're not really in the grammar at all. It's just a short-hand to save us bother and screen/paper space.

An Example & Parse Trees

We wrote a grammar for postfix expressions in the variable v. Here's a solution:

S → SSO | v
O → + | - | * | /

We saw that for a string like v v + v v - * there are lots of different derivations, all corresponding to the many choices of which particular non-terminal to substitute for at any step. Parse Trees allow us to represent the derivation of a string in a grammar without distinguish the order in which non-terminals are substituted. Thus, one and the same parse tree describes many derivations.

A Parse tree's structure

The parse tree for the previous example not only allowed us to examine a derivation without worrying about order, it also showed us the meaning of the postfix expression v v + v v - *, i.e. showed us how that expression was to be evaluated. This can be helpful if you're not familiar with postfix. So, for example, consider the expression 3 5 + 6 2 - *. If you annotate the v leaf nodes in the above parse tree with 3,5,6,2 going from left to right, you can follow the parse tree's structure and evaluate.

Exercise: Draw a parse tree for v v v v * + - using the above grammar and use it to evaluate 5 6 4 9 * + -.

Now, this parse tree tells you exactly how the expression is to be evaluated! One of the most important aspect of parse trees as far as programming languages are concerned is that their structure describes meaning.

Important: in computer science we often rely on the structure of a grammar's parse trees to provide meaning.

Ambiguity

In C++ we have expressions like cout << "the" << 3. Suppose we want to model such expressions with a grammar. Instead of worrying about concrete values, we'll just worry about some basic object types, like ostream, string, and int.

S → S << S | obj
obj → ostream | string | int

This grammar allows us to construct strings like ostream << string << int.

Exercise: Draw a parse tree for ostream << string << int.

Guess what: There's two valid parse trees for this string! Now that we're relying on parse trees to give meaning to strings, this is a real problem. We can't have two different meanings for a string. Then the meaining is abiguous, we won't know what to do! If a string has more than one parse tree in a given grammar, that grammar is called ambiguous. C++ uses the parse tree corresponding to (ostream << string) << int, a fact that is dictated by <<'s left-to-right associativity. We can rewrite the grammar in a way that encodes this associativity, and remove the ambiguity.

S → S << obj | obj
obj → ostream | string | int

It'd be nice if we could always do this, but it turns out that there are lagnuages out there that are inherently ambiguous, meaning that any grammar for the language will have strings with more than one parse tree. Inherent ambiguity would be quite a problem in a programming language!

Definition: A grammar is called ambiguous if there is a string for which there is more than one parse tree in the grammar.

Example with arithmetic expressions in infix

Most people don't like postfix (I don't know why!) and prefer our usual infix notation for arithmetic. So, let's support these knuckleheads and write a grammar for infix expressions on the values 2 and 3 using + and *.

S → S + S | S * S | 2 | 3

Now, let's write a parse tree for 2 + 3 * 3. Problem! There are two possible parse trees, and they lead to different answers when we evaluate the expression! Our grammar is ambiguous! That's a real problem, since this implies that the meaning of the expression is ambiguous!

We could try to write the grammar in a different way so that it generates the same strings, but without the ambiguity.

S → S + T | T
T → T * N | N
N → 2 | 3

This grammar is unambiguous (not that that's obvious) so there is one and only one parse tree for 2 + 3 * 3 and what's more, it matches our precedence rules! So that's good.

However, if you play around with this grammar, you'll find that there's no way to express that you'd like the addition "2 + 3" to happen before the multiplication "*3". In fact, to adequately express arithmetic in infix, you need parentheses. And that is one of the big reasons why postfix really is a superior language for an arithmetic calculator!

S → S + T | T
T → T * N | N
N → (S) | 2 | 3

Dangling Else

The "dangling else" problem is a classic example of ambigious grammar problems in real-world programming languages. Read about it in the book (p. 78), or read the Wikipedia article.