Unit 5: Verification

This is the archived version of this course from the Spring 2013 semester. You might find something more recent by visitning my teaching page.

**Readings**:- MR, Sections 7.1, 7.2, 7.4, and 7.6

This unit is all about using randomization to *check* or *confirm* things. The basic principle is used for example by your bank to verify your identity on the phone. They have a great deal of information on you, but it would take a long time to go through every question to check your answer. Instead, they will just choose a random question or two, and if you get those correct, they believe you would probably get the others correct as well.

These techniques are also related to the idea of *sampling* from statistics. When some pollster wants to know how the "average American" feels on some current issue or politician, they don't knock on everyone's door and ask how they feel. Instead, they randomly choose a small portion of the population and ask them.

Now imagine that a pollster asked a question of some population subset and they all had exactly the same answer. At what point would it be reasonable to conclude that *everyone* in the country feels that same way? That's what the verification techniques of this unit are about.

Caution: Notes from here on are pretty thin. I will fill them in if I get time but I wanted to get the important stuff "out there" for you all.

Cryptographically secure protocols for authenticating someone often come down to a *challenge-response* set up. If Alice wants to make sure she is talking to Bob, she issues a randomly chosen "challenge" to Bob, that only he could answer. Bob then uses his secret information to compute the "response" to that challenge, which Alice can then confirm as the right answer, authenticating Bob.

The key principle here is that the challenge is chosen randomly, so that no one who observes this protocol can later impersonate Bob, unless the same random challenge is chosen again (which is very unlikely).

This is another kind of power of randomness: the power to check things and not be fooled. See, if there were no randomness, then the challenge would always be the same (or at least it might be predictable), so the authentication protocol would not be secure. Even outside the realm of security, we still want to verify things from time to time. Hence this unit.

Say we have two big integers \(x\) and \(y\), and you claim the answer is \(z\). How could I check your work?

The simplest way would be just to compute \(x\) times \(y\) myself, and compare the result to \(z\). But if these are really huge integers, that might take a while. Is there a simpler way?

Of course there is! The algorithm is as follows:

- Choose a random prime \(p\)
- Compute \(x\bmod p\), \(y\bmod p\), and then multiply these and take mod \(p\) once more to get \(xy\bmod p\).
- Compute \(z\bmod p\)
- Compare the two results. If they're the same, then the answer is
*probably*right. If the two results are different, then the answer is*definitely*wrong.

The question is: how large should the prime \(p\) be to guarantee a low probability of false positives? We saw that a false positive occurs only if the difference

\[d = xy - z\]

is *not* equal to zero, and the prime \(p\) divides that difference,

\[p | (xy-z)\]

So the question is, how many different primes are "unlucky" in that they divide \(xy-z\)? Since each prime is at least 2, any number that is divisible by \(k\) different primes must be at least \(2^k\) in value. Therefore the total number of primes that could divide \(xy-z\) is at most \(\lg(xy-z)\). If we choose \(p\) from a set that has at least twice this many prime numbers, then we'll be set.

In particular, this algorithm hits a false positive with probability at most \(1/2\) if \(p\) is chosen from the set of all primes less than

\[\max\left(100, 4\lg x\ln\lg x, 4\lg y\ln\lg y, 2\lg z\ln\lg z\]

(Yes, that is a \(\ln\) in there. You could get a much more precise formula here by using the so-called Prime Number Theorem.)

A polynomial (univariate, in \(x\)) is the sum of coefficients times powers of \(x\), like

\[f(x) = 4 + x - 2x^2 + 5x^3 + 2x^4 - x^5\]

Now we can consider the same problem as with integers, but with polynomials: how to check whether \(f(x) \cdot g(x) = h(x)\), given those three polynomials, without actually computing the full product of \(f(x)\cdot g(x)\)?

Here the verification is actually even simpler: just pick a random value \(c\) for \(x\) and carry out the math. Then you check if the numbers are the same. As in, you compute \(f(c)\cdot g(c)\), which is an integer, and then see if that's the same integer you get from \(h(c)\).

How could this fail? Again, it has to do with the difference, \(h(x) - f(x)g(x)\). This difference is a polynomial, and if we have a false positive at \(c\), then this is a non-zero polynomial that evaluates to zero at \(x=c\). Think back to calculus - this means that \(c\) is a *root* of the polynomial \(h-fg\). The number of distinct roots is bounded by the *degree* of the polynomial. This is the highest exponent of \(x\) that appears in the polynomial.

So the number of values for \(c\) that could produce a false positive is at most two times the degree of \(h-fg\), which is at most 2 times the degree of \(h\), 4 times the degree of \(f\), or 4 times the degree of \(g\).

There is still a problem: the computed values like \(f(5)\) or \(f(-3)\) or whatever your randomly chosen evaluation points are will be really really huge. How can we get around this? We compute modulo a prime number \(p\), of course!

There are two constraints on the size of \(p\). First, it must be larger than the largest coefficient in \(f\), \(g\), or \(h\). Second, it must be large enough so that there are enough random values \(c\) to choose to avoid failure. Using what we did above, this means that we have to have

\[p \ge \max\left(2\deg h, 4\deg f, 4\deg g\right)\]

So overall, what you do is:

- Pick a prime that is larger than the degrees as specified above, and is also larger than any of the coefficients. (Note: \(p\) does not have to be chosen randomly.)
- Pick a random "evaluation point" \(c\) from the set \(\{0,1,2,\ldots,p-1\}\)
- Compute \(f(c)\cdot g(c) \bmod p\) and check whether that is squal to \(h(c)\bmod p\).

If you do this right, the probability of a false positive is less than \(1/2\), like before. Repeating this process multiple times - even with the same \(p\) - makes the probability of failure get smaller and smaller.

You can use the integer or polynomial verification techniques to compute a "fingerprint" of an entire file. Really, this is just computing a hash function of a really big key - one that might be multiple gigabytes in length!

In class we talked about how this is useful for example in checking the integrity of downloaded files or for saving space in filesystem backups.

To do a whole-file fingerprint, you first look at the file as raw data - a series of bytes, for example. Then you can treat each of these bytes as coefficients of a polynomial. Then you can use the polynomial verification technique above to compute a small "hash" of the entire file. Do this a couple times, and you have yourself a fingerprint.

The key here is that the size of the fingerprint can be *much, much* smaller (specifically, logarithmically smaller) than the size of the original file, and still guarantee a low probability that any other file would ever have that same fingerprint.

The key here is that, although the parameters (like \(p\) and the various \(c\) values) would be chosen randomly for a verification algorithm, the must stay the same in order to compute *consistent* fingerprints! So these will be chosen randomly for the fingerprinter (maybe), but after they are chosen they will be hard-coded into the program, never to change again.

Here's a reminder of how the Karp-Rabin algorithm works:

We want to find the first occurrence of a substring (or "needle") \(s\) of length \(m\) in a text \(T\) ("haystack") of length \(n\). The general plan is to:

- Choose a prime \(p\) to define a hash function \(h(x) = x\bmod p\)
- Write the substring \(s\) and text \(T\) as strings of bits (0 and 1) so that we can treat them as integers
- Compute \(h(s)\)
- Compute each \(h(x_i)\) for the every length-\(m\) substring \(x_i\) in the text \(T\).
- When there is a potential match, i.e., \(h(x) = h(x_i)\), then you have to actually compare the entire string \(s\) to \(x_i\) to see if it's really a match.

The trick is to compute the \(h(x_i)\) using a "rolling hash function" so that you use the value of the previous \(h(x_{i-1})\) to compute each \(h(i)\) in *constant time*. That makes the cost of this whole algorithm \(O(m)\) - fast!

Here's how the rolling hash function works. In what I have below, the `bi`

values are bits, 0's and 1's.

```
If you have x0 = b0 b1 b2 b3 b4...b[k-1]
and you have x1 = b1 b2 b3 b4 ... b[k-1] bk
Then we know that
x1 - 2*x0 = bk - 2^k * b0
Therefore
x1 = 2*x0 - 2^k * b0 + bk
Taking this mod p, we get
h(x1) = x1 mod p = 2*(x0 mod p) - 2^k * b0 + bk mod p
If you set c=(2^k) mod p, then
h(x1) = (2*h(x0) - c*b0 + bk) mod p.
That's the formula.
```