Problem: Longest Common Subsequence
Given: X and Y sequences of length m and n respectively
Produce: sequence Z that is a subsequence of both X and Y, and
has maximum length among all common subsequences
We'll talk about these as subsequences of strings (sequences of
characters), since the
visualizations are easier this way, but we accept arbitrary
objects, as long as there is an equality test.
So, for example,
LCS(holiday,echoed) = hodSo could we try to make a greedy algorithm for this? Well, there is a way to phrase solving the problem as a sequence of "moves". Here are the possible moves:
add&chop - add last element of X to solution, chop the last element off both sequences X-chop - chop the last element off X Y-chop - chop the last element off YTo get "hod" from our example, we might do
chop X, chop X, add-n-chop, chop X chop X chop Y, add-n-chop, chop Y, chop Y, add-n-chopThe problem is, however, how do we chose the right move? If the last two elements of X and Y are the same, the add-n-chop is clearly right. Otherwise, we certainly want either chop-X or chop-Y ... but which?
You can dream up some greedy strategies, but for every reasonable greedy choice you come up with, you will be able to find a counterexample. So, now what?
Algorithm: LLCS (Length of the Longest Common Subsequence)
Input: X,m - where X is a sequence of at least m elements
Y,n - where Y is a sequence of at least n elements
Output: k - the length of the longest common subsequence of X1,..,Xm and Y1,...,Yn
if m = 0 or n = 0
return 0
else if X[m-1] = Y[n-1]
return 1 + LLCS(X,m-1,Y,n-1)
else
return max( LLCS(X,m,Y,n-1) , LLCS(X,m-1),Y,n )
This works, as you can see if you try running LLCS.cpp.
However, you'll notice it's pretty slow. In fact, if you
instrument it to count the number of calls to LLCS, you'll find
that it gets out of hand.
$ ./go fourscoresevenyearsago hickorydickorydock The length of the longest common subsequence is 7 number of recursive calls is 46639471764That's 46 billion recursive calls to compute this! What's going on?
T[i][j] = LLCS(X,i,Y,j)If LLCS(X,i,Y,j) hasn't been computed yet, we'll set T[i][j] = -1 to flag that for us. This gives us something like: LLCSmemo.cpp. If we augment this to track the number of recursive calls, things look a lot better!
$ ./gom fourscoresevenyearsago hickorydickorydock The length of the longest common subsequence is 7 number of recursive calls is 419That's about 111 million times faster. I'd call that a god day's work!
Recall that we've phrased the finding of a common subsequence as a sequence of "moves" to make. The problem is, we don't know ahead of time what the best "move" to make is at any given step. However, we discover what at each point is the best "move" as LLCSmemo runs. So what if we augment it to remember which move worked best? In particular, lets create a second table, E, to store that information in. We'll set it up like this:
E[i][j] = 1 if add&chop is the optimum move for X1,...,Xi and Y1,...,Yj
= 2 if Y-chop is the optimum move for X1,...,Xi and Y1,...,Yj
= 3 if X-chop is the optimum move for X1,...,Xi and Y1,...,Yj
How does this help? Well, we can start over and construct the optimum solution by choosing moves, except that now we have the E-table that tells us, for every scenario, what the optimum move is.
An augmented version of the memoized algorithm that does all this is in lcslmemorecover.cpp.