Randomness and Probability in Data Structures

What is randomness? We usually define it in terms of unpredictability
With regard to data structures, there are several important structures that exploit randomness, and alsoimportant data structures that are probabilistic in that they return the right answer only within a certain probability.
What is Randomness? Funamentally, it's about unpredicability.
In many ways then, unpredicability is about the lack of knowledge, so in a sense, randomness is a property of us not the outside world.
How random numbers are generated is not at all obvious. Even with physical systems, it can be tricky to get right. Excercises:
1. Using two six-sided dice, describe an algorithm to generate integers in the range \([1,12]\) with a uniform distribution.
2. Using two six-sided dice, describe an algorithm to generate integers in the range \([1,10]\) with a uniform distribution.
Random Number Generation
Pseudo-random and true random numbers.

Pseudo-random: Numbers that appear random but are generated by a deterministic process. The really are a sequence, like Fibonacci nuymbers. In particular, we define the sequence as series of states, where each state is a function of the previous: \(X_{n+1} = f(X_n)\)
The actual state is unknown to the user of the generator.
Instead, the user gets a number that is a function of the state: \(R_n = g(X_n)\) such that \(g(X)\) is not reversible.
Generated using algorithms and initial seed values.
The seed tells the algorithm where to start in the sequence
Given the same seed and algorithm, the sequence of numbers will always be the same. Useful for debugging and repeatability. When we don't want those things, we see with the time.
Linear Congruential Generator (LCG):
- One of the oldest and simplest methods.
- Formula: \( X_{n+1} = (aX_n + c) \mod m \) where:
  - \( X_n \) is the current number.
  - \( X_{n+1} \) is the next number.
  - \(X_0\) is the seed.
  - \( a \), \( c \), and \( m \) are constants.
- Choice of parameters \( a \), \( c \), and \( m \) determine the quality and period of the generator.
- Advantages: Fast and requires minimal memory.
- Disadvantages: Can have short periods and poor randomness quality with bad parameter choices.
- Meresene Twister
- Mersenne prime is a prime number of the form \(2^p-1\) where \(p\) p is prime.
- The Mersenne Twister is based on the prime number \(2^{19937}-1\). It is designed so the the period is that number. That is quite a bit larger than our other methods.
- It doesnt use arithmatic, but looks at actual bits, shifting, anding, xoring.
- it also maintains a much more complicated state, including 624 different numbers.
- But other than that, it works similarly to other RNG algorithms. When you want a random number, you apply a function to the state to get a new state, and then apply a function to the new state
- To generate a new state you loop through the state numbers, take the largest bit of each number and concatenate it woth the 31 bits of the next:
```
    for i in range(32):
      x = (MT[i] & upperMask) + MT[i+1] & lowerMaks)
      xA = x >> 1
      if x % 2 != 0: 
             xA = xA ^ 9908B0DF
      MT[i] = MT[(i + m) % n] ^ xA
  
```
  This is called the twist.
- If \(x\) is the state we can generate random numbers by: \[ y \leftarrow x \oplus (x >> 11)\\ y \leftarrow y \oplus ((y << 7) \wedge 9D2C5680_{16})\\ y \leftarrow y \oplus ((y << 15) \wedge EFC60000_{16})\\ z \leftarrow y \oplus (y >> 18) \] This in known as the temper.

Limitations and potential pitfalls in random number generation

Periodicity: Pseudo-random number generators eventually repeat their sequence after a certain period. If you're not careful, this period can be short.'
Initialization: If the seed is predictable, then the generation is predictable. If you seed multiple generators with the time at the same time, they will copy each other.
Non-uniformity: You have to be careful to make sure all outputs are equally likely.
Cryptography. You need special generators with special properties to be cryptographically secure.

Bloom Filters

Bloom Filters are a data structure for determining if something is a mbmber of a set. For example, with a large database on disk, when a user makes a query, instead of going straight to the disk (slow!) we could check the filter first, and only go to disk if we know the item is there.
They work similarly to hash-tables
They don't explicitly use random numbers, but they are "random" in the sense that you'll only probably get the right answer.
Initializing:
- Array of \(m\) bits, all initialized to 0.
- Generate \(k\) different hash functions.
Insertion
- When inserting/adding something to the set, hash it with all \( k \) hash functions
- set all of those bits to 1.
Membership
- To check if something is in a Bloom filter, perform the same process - hash the item with all \( k\) hashes and check the bits.
- If any bit is 0, then the item is not a member of the set.
- If all the bits are 1, then the item probably is in the set.
Deletion - You cannot delete something from a Bloom filter, though counting filters can.
False positive rate:
Probability that a bit is set to 1 during an insert is \( \frac{1}{m}\)
so, the probability that a particular bit is not set to 1 by a particular hash function during the insertion of an element is
\( p = 1 - \frac{1}{m} \)
The probability that a particular bit is not set to 1 by any of the hash functions during the insertion of an element:
\( p^k \)
Probability that a particular bit is still 0 after inserting \( n \) elements:
\( (p^k)^n = p^{kn} \)
So the probability that it is 1 is
\( q = 1 - p^{kn} \).
Probability of a false positive (all hash functions map to bits that are already set to 1):
\( \text{False Positive Rate} = q^k \)
\( \text{False Positive Rate} = (1 - (1 - \frac{1}{m})^{kn})^k \)
Obviously, increasing \( m\) and \( k\) will decrease the false positive rate. Of course there are always trade-offs, increasing \( k\) increases time, and increasing \( m\) increases space. Often we estimate the number of things we think we will store, decide how big we want to make the table, then select our acceptable false positive rate, and then select a k that gives us a false positive rate under the target.
Counting Bloom filters - allow removal, but take up more space.

Oh, by the way:

from bitarray import bitarray
bit_array = bitarray('01101')

Randomness and Probability in Data Structures

Random Number Generation

Bloom Filters