# Unit 1: Discrete Math

**Readings**:- Knuth, Patashnik, and Graham.
*Concrete Mathematics*, Chapter 8. - Jean Gallier, An introduction to discrete probability.

- Knuth, Patashnik, and Graham.
**These notes**: printable version

After some introductory remarks, we'll go into some of the underlying math on probability and information theory that will get used over and over throughout the class. We'll just cover the basics here; expect more math as we go along and need to use it!

#### Beginning of Class 1

# 1 Introduction

In this class, we are going to learn about some of the ways random numbers get used in computing. To paraphrase the textbook, programs that make use of random numbers are often *simpler* or *faster* than their non-random counterparts (or both!). We will start by examining what random numbers are and how computer programs can obtain them. Then the bulk of the course will be spent looking at some well-known problems that can be solved more quickly or more simply using randomized algorithms. Finally, we'll look at things from a theoretical standpoint and ask whether there are some larger implications of all this.

But first, let's look at an example that illustrates the kind of way randomness can be used effectively.

## 1.1 Airplane terrorists

Consider the following scenario: An airplane carries 200 passengers. Due to strictly-enforced regulations, each passenger can carry on at most 3 ounces of liquid. But with 60 ounces of liquid, a bomb could be constructed to crash the plane. Therefore, if 20 or more passengers all manage to get 3 ounces of bomb-making liquid on the plane, there is a big problem. Fortunately, the vast majority of flights have no terrorists whatsoever: on at least 999 out of every 1000 flights, not a single passenger carries any bomb-making liquid.

To be clear, the input to any algorithm is going to be a list of 200 passengers, in some arbitrary order. All the algorithm can do is screen a given passenger for bomb-making liquid, and get a yes/no answer.

We want to develop effective ways of detecting and ultimately avoiding flights that are hijacked as described above. There is a way to test if any *single* passenger has bomb-making liquid, but it is somewhat disruptive and so we'd like to perform as few tests as possible.

Therefore we have two (possibly competing) goals:

- Allow only non-hijacked planes to fly
- Screen the fewest number of passengers for bomb-making liquid

The first goal can be classified as *correctness*, and the second as *efficiency*. In designing typical algorithms, we view the correctness requirement as absolute and work hard to optimize the efficiency. A theme we will repeatedly see with randomized algorithms is that they allow **trade-offs between correctness and efficiency**. Often, the loss of a correctness guarantee (a small chance that the algorithm gives the wrong answer) is more than offset by the gain in efficiency.

#### Beginning of Class 2

# 2 Discrete Probability

In order to understand randomness, we need to understand probability. We'll be covering some math concepts as needed throughout the class, but just to get started we need to cover some basics. You are strongly encouraged to read the entire chapter "An introduction to discrete probability", linked in the readings above.

## 2.1 Experiments and Outcomes

An *experiment* is anything for which the outcome may not be certain. This could be something like rolling a die, or flipping a coin, dealing a hand of cards in poker, drawing winning lottery numbers, or running a randomized algorithm.

More formally, an experiment is defined as a set of *outcomes*. (This set is usually written as \(\Omega\) and is also called the *sample space*.) Very importantly, the set of outcomes **must** satisfy two properties:

- They must be
**exhaustive**, meaning that they contain every possibility. - They must be
**mutually exclusive**, meaning that it's impossible to simultaneously have two different outcomes.

For example:

- Flipping a coin: There are two outcomes, heads or tails.
- Rolling a six-sided die: There are 6 outcomes, one for each side of the die.
- Dealing a hand in poker: There are \(\binom{52}{5} = 2598960\) ways to choose 5 cards out of 52.

When we are talking about a complicated process (such as a sophisticated randomized algorithm), the outcomes are always simple; there just might be way too many of them to count. And the outcomes depend on the random choices only, not on whatever we're trying to accomplish. For example, in poker, you're trying to win (perhaps by getting a good hand). Or maybe you're trying to lose. Or maybe you're playing cribbage instead. None of that changes the set of outcomes for a five-card hand that is dealt randomly from a 52-card deck.

## 2.2 Probability

Every outcome in the sample space has some known *probability* that it will occur. This probability is a number between 0 and 1, where 0 means "definitely won't happen" and 1 means "definitely will happen." (Note that 1 = 100%, but in hardcore math mode we just stick with the decimals thank you very much.)

This set of probability assignments is called a *probability distribution*, and formally it's a function \(\Pr\) that maps from the sample space \(\Omega\) to real numbers in the range \([0,1]\).

For some outcome \(\omega\), we write \(\Pr[\omega]\) as the probability that \(\omega\) will happen. Because the set of outcomes has to be exhaustive and mutually exclusive, it's guaranteed that the sum of probabilities of all outcomes must equal 1; that is:

\[\sum_{\omega \in \Omega} \Pr[\omega] = 1\]

The probability distribution of the outcomes is usually known in advance. And frequently, the probability of every outcome is the same. This is called the *uniform distribution*. For example:

- Flipping a coin: \(\Pr[\text{heads}] = \Pr[\text{tails}] = \tfrac{1}{2}\)
- Rolling a six-sided die: The probability of rolling a 2 is \(\tfrac{1}{6}\).

## 2.3 Combining Experiments

Most complex situations, such as playing a complete game or running an algorithm, don't involve just a single randomized experiment but many of them. Fortunately, combining multiple *independent* experiments is really easy.

Say I have experiments \(Exp_1\) and \(Exp_2\) with sample spaces \(\Omega_1\) and \(\Omega_2\). If these experiments are conducted *independently* from each other (so that the outcome of one doesn't in any way influence the outcome of the other), then they are combined as follows:

- The combined sample space is the set of pairs of outcomes from the two experiments, written formally as a cross-product \(\Omega_1 \times \Omega_2\).
- The probability of any pair of outcomes is the
**product of the original probabilities**.

For example, say I roll a 6-sided die with one hand and flip a coin with the other hand. This combined experiment now has 12 outcomes:

\[(1,H), (2,H), (3,H), (4,H), (5,H), (6,H), (1,T), (2,T), (3,T), (4,T), (5,T), (6,T)\]

Each the probability of rolling a 3 and getting tails is

\(\Pr[(3,T)] = \Pr[3] \cdot \Pr[T] = \tfrac{1}{6} \tfrac{1}{2} = \tfrac{1}{12}.\)

## 2.4 Events

So far, none of this should be really surprising or even interesting. That's because we usually don't care about the outcomes - their probabilities are already known to start out with, and are usually uniformly distributed so the probabilities are all the same. Boring.

What we *really* care about is some combination of outcomes that has some meaning relative to whatever game or algorithm or whatever else we're trying to accomplish. For example, in poker, you would really like to get a straight flush. This doesn't represent a single outcome, but multiple possible outcomes, because there are multiple different hands that would give you a straight flush.

An *event* in probability is a combination (subset) of outcomes, that has some meaning. Each outcome itself is called a *simple event*. The probability of an event is the sum of the probabilities of the outcomes that make it up.

Back to poker, there are 40 possible hands (outcomes) that result in a straight flush (the low card can be A up to 10, in any of 4 suits). So the straight flush "event" consists of these 40 hands. And the probability of getting a straight flush is

\[\frac{40}{\binom{52}{5}} = \frac{40}{2598960} \approx 0.000015391.\]

For another example, consider the rolling two six-sided dice (independently). We know there are \(6\times 6 = 36\) equally-likely outcomes. What's the chance that the sum of the two dice equals 10? Well, there are 3 outcomes that make this happen: (4,6), (5,5), and (6,4). So the probability of getting a sum of 10 is \(\tfrac{3}{36} = \tfrac{1}{12}\).

## 2.5 Random variables

We've already talked about one function on the outcomes of an experiment, which is the probability measure \(\Pr\) that maps each outcome to some number between 0 and 1.

Sometimes we also want to associate some *numerical value* with each outcome, which has nothing to do with its probability. For example, each possible lottery ticket might be associated to the amount of money that ticket will win. The probabilities of every possible ticket are the same, but the winnings are definitely not!

For largely historical reasons, these functions are not called (or written as) functions, but instead are called . They are usually written with a capital letter like \(X\) or \(Y\).

For example, if we go back to the experiment of rolling two six-sided dice, define the random variable \(X\) to be the sum of the two dice. We just figured out the probability that this equals 10, which can now be written conveniently as

\[\Pr[X = 10] = \tfrac{1}{12}.\]

The random variable notation is a little strange to get used to, but it also allows us to write down more complicated events easily. For example, consider the event that the two dice add up to 10 . There are actually 6 outcomes that make this happen: the three ways to make 10, plus the two ways (5,6) and (6,5) to make 11, and the single way (6,6) to make 12. Therefore

\[\Pr[X \ge 10] = \tfrac{6}{36}.\]

In this way, random variables are nice not only as a way of assigning a value to each outcome, but also as a way of describing more complicated events. You just have to remember one thing (repeat after me): **A random variable is a function, not a variable**.

Once we have a random variable, it's natural to ask about some *statistical measures* on that random variable, such as mean, mode, variance (standard deviation), and so on. The mean, or average, is the most important one, and it has a special name: the *expected value*.

Formally, the expected value of a random variable \(X\), written as \({\mathop{\mathbb{E}}}[X]\), is a *weighted average* formed by summing the value of \(X\) for each outcome, times the possibility of each outcome. Here's a complete definition:

Definition: Expected valueSuppose an experiment has \(n\) outcomes, each of which occurs with probability \(p_i\). And suppose that a random variable \(X\) associates with each outcome some value \(v_i\). Then the

expected valueof \(X\) is the sum of the value of each event, times the probability of that event, or\[{\mathop{\mathbb{E}}}[X] = \sum_{i=1}^n p_i\ v_i\]

Note we can also write this sum in another way as:

\[{\mathop{\mathbb{E}}}[X] = \sum_i i \cdot \Pr[X = i]\]

#### Beginning of Class 3

## 2.6 More distributions

We already said that the probability distribution of the set of outcomes is usually (but not always) *uniform* - meaning every outcome is equally likely. A random variable also defines a probability distribution, if we consider the probability that the random variable takes on each possible value. And these distributions are usually *not* uniform.

For example, consider the experiment of flipping three coins, and the random variable \(X\) defined as the number of heads among the three coin flips. For each of the \(2^3 = 8\) outcomes, we can easily see the value of this random variable:

Outcome | \(X\) |
---|---|

HHH | 3 |

HHT | 2 |

HTH | 2 |

HTT | 1 |

THH | 2 |

THT | 1 |

TTH | 1 |

TTT | 0 |

The distribution of the outcomes is uniform - each occurs with probability \(\tfrac{1}{8}\). But the distribution of \(X\) is seen by flipping this table around, adding together the rows with the same value:

\(X\) | Num. of Outcomes | \(\Pr[X=i]\) |
---|---|---|

0 | 1 | 1/8 |

1 | 3 | 3/8 |

2 | 3 | 3/8 |

3 | 1 | 1/8 |

In other words, you're much more likely to get 1 head or 2 heads than you are to get 0 or 3.

This distribution actually has a name - it's called a *binomial distribution* with parameters \(p=.5\) and \(n=3\). You can find much more about the binomial distribution on Wikipedia.

Other distributions that might come up in this class, along with my pithy description, are:

- Uniform distribution (rolling a fair \(n\)-sided die)
- Bernoulli distribution (flipping a weighted coin)
- Binomial distribution (flipping a weighted coin \(n\) times and counting how many heads)
- Geometric distribution (flipping a weighted coin until the first time it comes up heads, and counting how many flips that took)

## 2.7 Calculating probabilities and expectations

There are going to be many, many times when we want to compute the probability of some event, or the expected value of some random variable. And of course these skills are useful in a variety of real-like situations, not just in analyzing randomized algorithms.

There are basically three paths to take to perform one of these calculations:

- List out every one of the outcomes, and then add up the probabilities (or the weighted sum, for an expectation). Since we always know the probability of the outcomes - they're usually all equally likely - this is mostly a matter of
*counting*. - Estimate the probability using some approximations or
*bounds*. We will see some of these as we go along in this class, and if you look up the distributions mentioned above you can find likes to some popular bounds or estimates for those distributions. - Approximate by actually running an experiment. This involves programming of course, and is a really excellent way to check your work from one of the first two methods.

Methods 1 and 3 are immediately available to you, and based on the definitions you should *technically* be able to calculate any probability or expectation that we could come up with. Of course the problem will often be that there are far too many possibilities to list them all out manually. Throughout the course, we will learn some counting techniques and math tricks to make method 1 easier, and we'll also see in many cases that just having an estimate or a bound (method 2) is good enough.

# 3 Back to algorithms

Now that we understand a little bit about probability, let's go back to the original example and see how probabilistic concepts show up in designing and analyzing algorithms.

## 3.1 Deterministic algorithm

A *deterministic algorithm* is the kind of algorithm you are used to: it takes some input, does some computation that depend only on that input, and then produces some output. The important thing here is that the actions taken by the algorithm depend only on whatever the input is. So if you run the algorithm twice on the same input, the algorithm will do exactly the same thing.

In the case of the "airplane terrorists", we came up with the following deterministic strategy:

- Screen the first 181 passengers
- If any of the first 181 has bomb liquid, screen all 200
- If at least 20 passengers have bomb liquid, don't let the plane fly.

This algorithm is *correct* because it's guaranteed not to let a hijacked plane fly. If the plane has at least 20 terrorists, we will definitely screen at least one of them on step 1, and then the entire plane will be checked and ultimately grounded. Conversely, if the plane does *not* have at least 20 terrorists, there is no way that it will be grounded by our algorithm.

A standard, *worst-case analysis* of this algorithm is pretty straightforward: the worst case is that we have to screen all 200 passengers. So by this analysis our algorithm is no better than just screening every passenger, every time. That seems wrong through, because we know that most flights won't have any terrorists at all.

So instead, we will do a *probabilistic analysis* of the number of passengers screened. Rather than just focusing on the worst-case cost, we will instead focus on the "expected" cost, which corresponds roughly to the average cost over many, many flights.

Using our terminology from probability theory here, the *experiment* in this case will be the choice of which passengers (if any) have bomb-making liquid. And the random variable \(X\) is defined as the number of passengers screened by the algorithm.

In our case, the two possible events are "there is a hijacker within the first 181 passengers" or "there isn't a hijacker in the first 181 passengers", as this is what determines the total number of passengers screened. From the problem definition, the probability that there is no hijacker on the plane at all - and so certainly not one within the first 181 passengers - is at least 999/1000. Therefore the expected cost is at most:

\[E(\text{passengers screened}) \le \frac{999}{1000}\cdot 181 + \frac{1}{1000}\cdot 200 = 181.019\]

In other words, the number of passengers screened *on average* is less than 182.

## 3.2 Randomized algorithm

The basic idea of a randomized algorithm is pretty obvious here: we will *randomly* select which passengers to screen. In class we looked at a variety of ways to do this random selection. As is often the case (in math, in life, in computer science, ...) the best approach is the simplest one:

- Select \(n\) passengers at random
- Screen these \(n\) passengers for bomb-making liquid
- If any of the \(n\) has bomb liquid, screen all 200 passengers
- If at least 20 passengers have bomb liquid, don't let the plane fly.

Here \(n\) is some *constant* that is selected beforehand. The big question is: how large should \(n\) be? If it's small, then the algorithm will be rather efficient (not many passengers screened), but it will also not be very correct (some planes will blow up). On the other hand, if \(n\) is very large, the algorithm will be more correct but less efficient.

One extreme is the equivalent of the deterministic algorithm: let \(n=181\). Then there is no chance of failure, so the algorithm is perfectly correct, but not very efficient. The other extreme would be to set \(n=0\) and not screen any passengers. Then the algorithm is quite efficient, but there is a one-in-a-thousand chance of failure.

What we see is something very typical in randomized algorithms: a trade-off between correctness and efficiency. It is easy to view this as a shortcoming of randomized algorithms, since the algorithm will fail with some (small) probability. But in fact it is a *feature* that we don't get with deterministic algorithms. In Problem 3, you showed that the deterministic algorithm is completely uncompromising: if we improve the efficiency even the smallest amount, the correctness (probability of failure) goes immediately to the max. On the other hand, Problem 2 showed that, by only scanning about one-fourth of the passengers, the probability of failure is less than one in a million.

The general theme here is called the **principle of many witnesses** and we will see it crop up repeatedly in this class. In this scenario, we are taking advantage of the fact that many hijackers must collude in order to blow up the plane: the 20 (or more) hijackers are the "many witnesses". Deterministically, it can be difficult to find a single witness, but with randomization, it becomes much easier.

The other principle at work here is called **foiling an adversary**. The consistency of deterministic algorithms is actually their weakness: they can be fooled by someone who knows the algorithm! On the other hand, if the strategy is random, then while it can be beaten *some of the time*, there is no way to beat it *all of the time*. This is a powerful idea behind many randomized algorithms that we will see.