Class 30: A Pumping Lemma for CFGs

Sections 3.6 of Theory of Computing, A Gentle Introduction. Notice that Lemma 3.6.2 is a little stronger than the Pumping Lemma we gave in class. We might call it "Version 2.0".
Then printout the homework and answer the questions on that paper.

Where we are
What we did the last two classes was this:
	  CFGs ----> PDAs  Note: this is what bison does!
I.e. we gave an algorithm that converts a CFG to a PDA accepting the same language, and we proved that the algorithm met its specification --- that the PDA it constructs really accepts the same language as generated by the input grammar. Does this mean CFGs and PDAs are equivalent in terms of the languages they define? No! To prove that we would also need an algorithm to convert a PDA to a CFG generating the same language. Such an algorithm exists! So, in fact, we have
	  CFGs <----> PDAs
and thus we have a proof that CFGs and PDAs are equivalent, so that any language that can be defined with one formalism can also be defined with the other. These languages, the CFG/PDA-definable languages, we call Context Free Languages.

We will not cover the algorithm for converting PDAs to CFGs. It would take at least another lecture, and time is limited. It's very important to know that the algorithm exists, since it finishes off the proof that CFGs and PDAs are equivalent, but the algorithm doesn't have the incredible practical value of the CFG-to-PDA conversion, which allows us to automaticall convert a specification of a language (as a grammar) to a program that parses the language (as a PDA).

A Pumping Lemma for CFGs
The below roughly follows the example we used in class. I apologize for the minor notational differences. One big difference, in class the grammar we used was
T → BQ | b
Q → TA | a
A → a
B → b
C → aa
which changed the parse tree by having a non-empty z. If I have time I'll update the rest of these notes to reflect that change.

We considered the grammar $G$:

S \longrightarrow TQ\\
T \longrightarro...
A \longrightarrow a\\
B \longrightarrow b\\
\end{array} \end{displaymath}

with start symbol $S$ and alphabet $\{a,b\}$. Consider the parse tree for the string $babaa$:

\scalebox {0.8}{\includegraphics{ParseTree0.eps}}

Notice the variable $Q$ that appears twice along one path from root to a leaf node. (The two occurrences are circled.) We can cut out and splice in the subtrees rooted at $Q$ to form new trees that are still perfectly valid subtrees.

\scalebox {0.6}{\includegraphics{ParseTree1.eps}}

By constructing these new parse trees, we're really constructing new strings in the language! You'll notice that in our original parse tree, we broke $babaa$ up into $vwxy$, where $v = ba$, $w = b$, $x =
a$, and $y = a$. What the new parse trees show is that not only is $vwxy$ in $L(G)$, but also $vx \in L(G)$ and $vw^2xy^2 \in L(G)$. Of course, we could keep splicing in more and more copies of the larger subtree in place of the smaller, and show that $vw^mxy^m \in L(G)$ for any $m \in \mathbb{N}$. We get a whole infinite family of strings in $L(G)$, specifically $ab(b^m)a(a^m)
\in L(G)$.

Towards a pumping lemma for CFGs
All of the previous discussion should look suspiciously like the puming lemma for regular languages. If we were to write it up pumping lemma style, we'd get something like this:
Let G = (Σ,NT,R,S) be a CFG and let u be a string in L(G). If a parse tree for u has repeated symbols along a path from the root to some leaf then there are strings v,w,x,y,z such that
  • u = vwxyz
  • at least one of w, y is not λ
  • vwkxykz in L(G) for all k ≥ 0.
The two parts highlighted in red are problematic. The second item is a problem because we have no way (right now!) to guaruantee it, and if both w and y are λ pumping doesn't yield new strings and the whole exercise is pointless. We'll deal with that problem later. The first highlighted section is problematic we actually have to have a parse tree for string u before we can apply this puming lemma-like result. To be a good analogue of the pumping lemma for regular lanuges, we need the condition to be replaced with "u is long enough" ... whatever "long enough" is. So we need to find a condition on the length of u that would guarantee that a parse tree for u has repeated symbols along a path from the root to some leaf. Then we'd have ourselves a real pumping lemma for CFGs.

What property of a parse tree would gaurantee the presence of repeated symbols along a path from the root to some leaf? If the depth of the tree is greater than |NT| then, by the definition of "depth" for trees, there is a path from root to a leaf of length at least |NT|+1. This path has |NT|+1 interior nodes, each labeled with a non-terminal. Since there are only |NT| non-terminals, some non-terminal appears at two nodes along the path (by the good old pigeonhole principle) and there we are. So if d stands for the depth of the tree, we need to find a bound on |u| that guarantees that

|NT| + 1 ≤ d
It's not immediately clear how to tie the length of u to the depth of the parse tree, but it's pretty clear how to tie it to the number of leaves in the parse tree:
|u| ≤ # of leaves
Then the question becomes this: can we connect the depth and the number of leaves? In class we discussed this in some detail, and we determined that the number of leaves in a parse tree of depth d must be less than or equal to the number of leaves in a complete tree of depth d where each non-leaf has the most children the grammar allows, which is given by the length of the longest right-hand-side of any rule, i.e. max{ |t| | (A,t) ∈ R }. A complete tree with branching factor B and depth d has Bd leaves, so we get
# of leaves ≤ max{ |t| | (A,t) ∈ R }d
Tying it all together we get
|u| ≤ # of leaves ≤ max{ |t| | (A,t) ∈ R }d
Now what bound on |u| would guarantee |NT| + 1 ≤ d? Clearly this does it:
max{ |t| | (A,t) ∈ R }|NT| + 1 ≤ |u| ≤ # of leaves ≤ max{ |t| | (A,t) ∈ R }d
So, the bound on u we need is |u | ≥ max{ |t| | (A,t) ∈ R }|NT| + 1.

A pumping lemma for CFGs Version 1.0
The above discussion gives us this pumping lemma version 1.0 for Context free grammars:
Let G = (Σ,NT,R,S) be a CFG and let u be a non-empty string in L(G). If |u | ≥ max{ |t| | (A,t) ∈ R }|NT| + 1 then there are strings v,w,x,y,z such that
  • u = vwxyz
  • at least one of w, y is not λ
  • vwkxykz in L(G) for all k ≥ 0.

Coda: The only thing we haven't justified is how we know that u and y aren't both the empty string. The short answer is ... maybe they are. On the other hand, if both u and y were empty, we could simply replace the "big" subtree by the "little" subtree (see the figures above) and we'd have a parse tree for the same string w, but with fewer nodes. If you do this every time you find your repeated non-terminal symbol along a path from the root to a leaf that has both u and y empty, the process has to stop eventually. After all, the string w stays the same, but the number of nodes in the parse tree decreases at each step. Since the number of nodes can't go all the way down to zero and still produce a non-empty string! So, eventually you'll find repeated non-terminal symbols that yield the decomposition w = uvxyz where not both u and y are the empty string.

Christopher W Brown
Last modified: Mon Nov 16 09:08:34 EST 2009