Randomness and Probability in Data Structures
- What is randomness? We usually define it in terms of
unpredictability
- With regard to data structures, there are several
important structures that exploit randomness, and
alsoimportant data structures that are probabilistic in that
they return the right answer only within a certain probability.
- What is Randomness? Funamentally, it's about unpredicability.
- In many ways then, unpredicability is about the lack of knowledge,
so in a sense, randomness is a property of us not the outside world.
- How random numbers are generated is not at all obvious. Even with
physical systems, it can be tricky to get right.
Excercises:
- Using two six-sided dice, describe an algorithm to
generate integers in the range \([1,12]\) with a uniform distribution.
- Using two six-sided dice, describe an algorithm to
generate integers in the range \([1,10]\) with a uniform
distribution.
Random Number Generation
- Pseudo-random and true random numbers.
- Pseudo-random: Numbers that appear random but are generated by
a deterministic process. The really are a sequence, like
Fibonacci nuymbers. In particular, we define the sequence as
series of states, where each state is a function of the
previous: \(X_{n+1} = f(X_n)\)
- The actual state is unknown to the user of the generator.
- Instead, the user gets a number that is a function of the
state: \(R_n = g(X_n)\) such that \(g(X)\) is not reversible.
- Generated using algorithms and initial seed values.
- The seed tells the algorithm where to start in the sequence
- Given the same seed and algorithm, the sequence of numbers
will always be the same. Useful for debugging and
repeatability. When we don't want those things, we see with the time.
- Linear Congruential Generator (LCG):
- Limitations and potential pitfalls in random number generation
- Periodicity: Pseudo-random number generators eventually
repeat their sequence after a certain period. If you're not
careful, this period can be short.'
- Initialization: If the seed is predictable, then the
generation is predictable. If you seed multiple generators with
the time at the same time, they will copy each other.
- Non-uniformity: You have to be careful to make sure all
outputs are equally likely.
- Cryptography. You need special generators with special
properties to be cryptographically secure.
Bloom Filters
- Bloom Filters are a data structure for determining if
something is a mbmber of a set. For example, with a large database
on disk, when a user makes a query, instead of going straight to
the disk (slow!) we could check the filter first, and only go to
disk if we know the item is there.
- They work similarly to hash-tables
- They don't explicitly use random numbers, but they are "random"
in the sense that you'll only probably get the right answer.
- Initializing:
- Array of \(m\) bits, all initialized to 0.
- Generate \(k\) different hash functions.
- Insertion
- When inserting/adding something to the set, hash it with
all \( k \) hash functions
- set all of those bits to 1.
- Membership
- To check if something is in a Bloom filter, perform the
same process - hash the item with all \( k\) hashes and check
the bits.
- If any bit is 0, then the item is not a member of the set.
- If all the bits are 1, then the item probably is in
the set.
- Deletion - You cannot delete something from a Bloom filter,
though counting filters can.
- False positive rate:
Probability that a bit is set to 1 during an insert is \( \frac{1}{m}\)
so, the probability that a particular bit is not set to 1 by a
particular hash function during the insertion of an element is
\( p = 1 - \frac{1}{m} \)
The probability that a particular bit is not set to 1 by any of the hash functions during the insertion of an element:
\( p^k \)
Probability that a particular bit is still 0 after inserting \( n \) elements:
\( (p^k)^n = p^{kn} \)
So the probability that it is 1 is
\( q = 1 - p^{kn} \).
Probability of a false positive (all hash functions map to bits that are already set to 1):
\( \text{False Positive Rate} = q^k \)
\( \text{False Positive Rate} = (1 - (1 - \frac{1}{m})^{kn})^k \)
- Obviously, increasing \( m\) and \( k\) will decrease the false
positive rate. Of course there are always trade-offs, increasing \(
k\) increases time, and increasing \( m\) increases space. Often we
estimate the number of things we think we will store,
decide how big we want to make the table, then select our acceptable
false positive rate, and then select a k that gives us a false
positive rate under the target.
- Counting Bloom filters - allow removal, but take up more space.
Oh, by the way:
from bitarray import bitarray
bit_array = bitarray('01101')