**Reading**-
Sections 3.6 of
*Theory of Computing, A Gentle Introduction*. Notice that Lemma 3.6.2 is a little stronger than the Pumping Lemma we gave in class. We might call it "Version 2.0". **Homework**- Then printout the homework and answer the questions on that paper.

**Where we are**-
What we did the last two classes was this:
`CFGs ----> PDAs Note: this is what bison does!`

I.e. we gave an*algorithm*that converts a CFG to a PDA accepting the same language, and we*proved*that the algorithm met its specification --- that the PDA it constructs really accepts the same language as generated by the input grammar. Does this mean CFGs and PDAs are equivalent in terms of the languages they define? No! To prove that we would also need an algorithm to convert a PDA to a CFG generating the same language. Such an algorithm exists! So, in fact, we haveCFGs <----> PDAs

and thus we have a proof that CFGs and PDAs are*equivalent*, so that any language that can be defined with one formalism can also be defined with the other. These languages, the CFG/PDA-definable languages, we call*Context Free Languages*.We will not cover the algorithm for converting PDAs to CFGs. It would take at least another lecture, and time is limited. It's very important to know that the algorithm exists, since it finishes off the proof that CFGs and PDAs are equivalent, but the algorithm doesn't have the incredible practical value of the CFG-to-PDA conversion, which allows us to automaticall convert a

*specification*of a language (as a grammar) to a*program*that parses the language (as a PDA). **A Pumping Lemma for CFGs**-
The below roughly follows the example we used in class. I
apologize for the minor notational differences. One big
difference, in class the grammar we used was
S → TQC

which changed the parse tree by having a non-empty

T → BQ | b

Q → TA | a

A → a

B → b

C → aa*z*. If I have time I'll update the rest of these notes to reflect that change.We considered the grammar :

with start symbol and alphabet . Consider the parse tree for the string :Notice the variable that appears twice along one path from root to a leaf node. (The two occurrences are circled.) We can cut out and splice in the subtrees rooted at to form new trees that are still perfectly valid subtrees.

By constructing these new parse trees, we're really constructing new strings in the language! You'll notice that in our original parse tree, we broke up into , where , , , and . What the new parse trees show is that not only is in , but also and . Of course, we could keep splicing in more and more copies of the larger subtree in place of the smaller, and show that for any . We get a whole infinite family of strings in , specifically .

**Towards a pumping lemma for CFGs**-
All of the previous discussion should look suspiciously like
the puming lemma for regular languages. If we were to write
it up pumping lemma style, we'd get something like this:
Let G = (Σ,NT,R,S) be a CFG and let

The two parts highlighted in red are problematic. The second item is a problem because we have no way (right now!) to guaruantee it, and if both*u*be a string in L(G). If a parse tree for*u*has repeated symbols along a path from the root to some leaf then there are strings*v,w,x,y,z*such that- u = vwxyz
- at least one of
*w, y*is not λ - vw
^{k}xy^{k}z in L(G) for all k ≥ 0.

*w*and*y*are λ pumping doesn't yield new strings and the whole exercise is pointless. We'll deal with that problem later. The first highlighted section is problematic we actually have to have a parse tree for string*u*before we can apply this puming lemma-like result. To be a good analogue of the pumping lemma for regular lanuges, we need the condition to be replaced with "*u*is long enough" ... whatever "long enough" is. So we need to find a condition on the length of*u*that would guarantee that a parse tree for*u*has repeated symbols along a path from the root to some leaf. Then we'd have ourselves a real pumping lemma for CFGs.What property of a parse tree would gaurantee the presence of repeated symbols along a path from the root to some leaf? If the depth of the tree is greater than |NT| then, by the definition of "depth" for trees, there is a path from root to a leaf of length at least |NT|+1. This path has |NT|+1 interior nodes, each labeled with a non-terminal. Since there are only |NT| non-terminals, some non-terminal appears at two nodes along the path (by the good old pigeonhole principle) and there we are. So if

*d*stands for the depth of the tree, we need to find a bound on |*u*| that guarantees that|NT| + 1 ≤

It's not immediately clear how to tie the length of*d**u*to the*depth*of the parse tree, but it's pretty clear how to tie it to the*number of leaves*in the parse tree:|

Then the question becomes this: can we connect the depth and the number of leaves? In class we discussed this in some detail, and we determined that the number of leaves in a parse tree of depth*u*| ≤ # of leaves*d*must be less than or equal to the number of leaves in a complete tree of depth*d*where each non-leaf has the most children the grammar allows, which is given by the length of the longest right-hand-side of any rule, i.e. max{ |*t*|**|***(A,t) ∈ R*}. A complete tree with branching factor*B*and depth*d*has*B*^{d}leaves, so we get# of leaves ≤ max{ |

Tying it all together we get*t*|**|***(A,t) ∈ R*}^{d}|

Now what bound on |*u*| ≤ # of leaves ≤ max{ |*t*|**|***(A,t) ∈ R*}^{d}*u*| would guarantee |NT| + 1 ≤*d*? Clearly this does it:max{ |

So, the bound on*t*|**|***(A,t) ∈ R*}^{|NT| + 1}≤ |*u*| ≤ # of leaves ≤ max{ |*t*|**|***(A,t) ∈ R*}^{d}*u*we need is |*u*| ≥ max{ |*t*|**|***(A,t) ∈ R*}^{|NT| + 1}. **A pumping lemma for CFGs Version 1.0**-
The above discussion gives us
this pumping lemma version 1.0 for Context free grammars:
Let $G = (\Sigma,NT,R,S)$ be a CFG and let $u$ be a non-empty string in $L(G)$. If $u \geq \text{max}\{ |t| \mid (A,t) \in R \}^{|NT|+1}$ then there are strings $v,w,x,y,z$ such that

- $u = vwxyz$
- at least one of
*w, y*is not λ - $vx^kxy^kz \in L(G)$ for all $k \geq 0$.

**Coda:**The only thing we haven't justified is how we know that*u*and*y*aren't both the empty string. The short answer is ... maybe they are. On the other hand, if both*u*and*y*were empty, we could simply replace the "big" subtree by the "little" subtree (see the figures above) and we'd have a parse tree for the same string*w*, but with fewer nodes. If you do this every time you find your repeated non-terminal symbol along a path from the root to a leaf that has both*u*and*y*empty, the process has to stop eventually. After all, the string*w*stays the same, but the number of nodes in the parse tree decreases at each step. Since the number of nodes can't go all the way down to zero and still produce a non-empty string! So, eventually you'll find repeated non-terminal symbols that yield the decomposition*w = uvxyz*where not both*u*and*y*are the empty string.

Christopher W Brown Last modified: Mon Nov 16 09:08:34 EST 2009