Bloom Filters, Revisited

Lab Ground Rules

Collaboration Policy

At most one partner. You can work on the lab with at most one partner. Do not help a student who is not your partner. Do not ask for help a student who is not your partner.
You can even help debug your partner's code. In this course, we allow the partners to co-work even in the level of code debugging.
Always state your partner in lab reports. Otherwise, your may endanger yourself into honor issues.

Deadline Policy

Next Friday night. Unless explicitly stated, lab deadlines will be a week from tonight (2359).
Bonus on early submissions.
- If you submit your lab work by Wednesday night (by 2359), you will get 5% bonus.
Late submission penalty.
- You can submit late by Sunday night (by 2359) with 10% penalty.

Writing a Class in Python

You can write a class using a keyword class. See the following example. Python uses the keyword self to refer to the object itself.

# ex.py 
class Toy:
  def set(self, a):
    self.data = a

You can create a member variable anytime!

In Python, we know a variable can be created anytime. The same principle applies for class member variables as well --- they can be created at any time! In the above example, when the function set is executed, a class member variable data will be created!

We can check this phenomenon by inspecting the following code. Please read the inline comments carefully.

>>> from ex import *      # import ex.py 
>>> x = Toy()             # x points to a newly created Toy object
>>> x.data                # x.data is not yet created !!!!!
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Toy' object has no attribute 'data'
>>> x.set(10)             # now, x.data is created!!!!!
>>> x.data
10

Constructor

Dynamic variable creation may not always be cool. You may want to create all class member variables when an instance is created. You can do this in the constructor member function __init__(self)

# ex2.py
class Toy:
  def __init__(self):
    self.data = 0

  def set(self, a):
    self.data = a

Now, no error takes place in the following code:

>>> from ex2 import *
>>> x = Toy()
>>> x.data
0
>>> x.set(10)
>>> x.data
10

Inheritance

A class can be inherited. See the following code for syntax.

# fruit.py
class Fruit:
  def __init__(self):
    self.fruitType = 'fruit'    

  def isA(self):
    return self.fruitType

class Berry(Fruit):	####### Berry inherits Fruit
  def __init__(self):
    self.fruitType = 'berry'

  def print_type(self):
    print(self.fruitType)

The following code tests the above code.

>>> from fruit import *
>>> f = Fruit()
>>> f.isA()
'fruit'
>>> b = Berry()
>>> b.isA()
'berry'
>>> b.print_type()
berry
>>> f.print_type()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Fruit' object has no attribute 'print_type'

[10pts] Part 1: Setting BF Parameters

The false positive rate p of a BF is computed as follows: \[ p = (1 - e^{- \frac{kn} m})^k \] In the above:

n : the maximum number of items that a BF would contain.
m : the number of bits in the bit array B.
k : the number of hash functions associated with the BF.
p : the false positive ratio.

Fix k = 8 and m = 16n

To make our lives simpler in this lab, we simply fix \(k=8\) and \(m = 16n\). Then, we have \[ p = (1 - e^{-kn/m})^k = (1 - e^{-1/2})^{8} \approx 0.0006\] We will be happy with 0.06% false positive rate. Note that the formula m = 16n impies that the BF will spend 16 bits (i.e., 2 bytes) for each inserted item.

Your task

Write your code bf.py so that the above code works correctly. In particular,

Write a class with name BF.
Write a constructor of class BF that sets parameters correctly. The input to the constructor is \(n\), that is, the maximum number of items that the filter would contain.
The bit array B should be represented with a Python bytearray object. Note a cell of a bytearray object is an 8-bit number (0 - 255). So, storing 80 bits needs 10 bytes.


>>> from bf import *
>>> filter = BF(5)
>>> filter.n
5
>>> filter.m
80
>>> filter.k
8
>>> type(filter.B)
<class 'bytearray'>
>>> len(filter.B)
10
>>> filter.B
bytearray(b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')


>>> from bf import *
>>> filter = BF(13)
>>> filter.n
13
>>> filter.m
208
>>> filter.k
8
>>> filter.B
bytearray(b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')

[30pts] Part 2: Bit Manipulations

We will extend what we did in the the first homework to work for any arbitrary bytearray object. In particular, write global functions appropriately in bf.py so that the your code works as follows.

Inspect the sample run carefully so you understand what you should do.

Tip: Each cell of a bytearray object is a 8-bit number (0 - 255).

The functions set_one_at(), set_zero_at(), get_at() must be implemented to take O(1) time.

Think about accessing the right element. Use the right index in the bytearray and read/write only that element.


>>> import bf                       ## you need these three lines 
>>> from bf import *                ## for constant reloading
>>> from importlib import reload    

>>> B = bytearray([0]*2)
>>> bprint(B)                       ## write function bprint 
0000000000000000

>>> reload(bf)                      ## whenever you modify your bf.py 
>>> from bf import *                ## you can reload the new code 
>>> hprint(B)                       ## write function hprint  
00 00

>>> reload(bf)
>>> from bf import *
>>> set_one_at(B, 0)                ## write function set_one_at
>>> bprint(B)
1000000000000000
>>> set_one_at(B, 14)
>>> bprint(B)
1000000000000010

>>> reload(bf)
>>> from bf import *
>>> set_zero_at(B, 0)               ## write funtion set_zero_at
>>> bprint(B)
0000000000000010
>>> hprint(B)
00 02


>>> reload(bf)
>>> from bf import *                ## write function get_at that returns an integer
>>> t = get_at(B, 14)
>>> t
1
>>> type(t)
<class 'int'>
>>> get_at(B, 13)
0

[25pts] Part 3: Hashes

Use SHA256

We will use a cryptographic hash function sha256 to compute the hashes.

A BF needs multiple hashes.

Note we need to have k = 8 hash values. Since cryptographic hash is a little bit expensive operation, we will just make a single call. Fortuntely, there are many random bits we can use, so we will partition 256 bits into 8 parts (i.e., each with 32 bits).

In particular,

Given a data item, compute a sha256 hash.
Split the 256 bits into 8 chunks so that each chunk is 4 bytes (i.e., 32 bits) long.
Convert each 4-byte chunk into a number.
Usually, the converted numbers are too big to be used as an index for the BF B. It should be between 0 and m-1. Well, take mod m to make it fit.

Your Task: Converting a 4-byte object into a number: big-endian

Given four bytes [a, b, c, d], write a function that computes \[ a \cdot 256^3 + b \cdot 256^2 + c \cdot 256^1 + d \cdot 1\] This way of conversion is said to be big-endian. As a side note, the opposite way (little-endian) also exists (i.e, a + b*255 + c*255^2 + d*255^3). Write a conv function in bf.py that works as follows:


>>> from bf import *
>>> conv(bytearray([0,0,0,7]))
7
>>> conv(bytearray([0,0,1,0]))
256
>>> conv(bytearray([0,1,0,0]))
65536
>>> conv(bytearray([0,1,0]))
too small input!
>>> conv(b"1234")
825373492

Your Task: Hash functions in BF

Now, we are ready to write the hash functions in BF. As we mentioned before, we first compute sha256 hash of a given data item, then covert each 4-byte chunk into an integer.


>>> filter = BF(5)
>>> filter.hashes(b"hello")
[754077114, 1605411598, 652753706, 3317293726, 454434396, 531055198, 1929655138, 2475399204]

Note each number is too big, since B contains 80 bits (i.e., m = 80). So we take mod m. Finally,


>>> from bf import *
>>> filter = BF(5)
>>> filter.hashes(b"hello")
[74, 78, 26, 46, 76, 78, 18, 4]

[10pts] Part 4: Finishing Up

Finally, we are ready to write two BF functions add and check!


>>> from bf import *
>>> filter = BF(5)
>>> filter.add(b"hello")
>>> bprint(filter.B)
00001000000000000010000000100000000000000000001000000000000000000000000000101010

>>> filter.check(b"hello")
True

>>> filter.check(b"Hello")
False

>>> filter.check(b"world")
False

>>> filter.add(b"world")
>>> bprint(filter.B)
00001000000000000010001000100001001000000000001100000000100000000000000100101010
>>> hprint(filter.B)
08 00 22 21 20 03 00 80 01 2a

>>> filter.check(b"world")
True
>>> filter.check(b"Hello")
False

[10pts] Part 5: Blacklisting

Let's see if your code works well by testing the blacklist scenario given in IT430.

The file containing the suspicious domains: suspicious.txt.
The file containing the observed domains: observed.txt.

Check out the following code. We hope the code test_bf.py makes sense.


# test_bf.py

from bf import *

blacklist = open("suspicious.txt").readlines()

# number of items in the blacklist
n = len(blacklist)

#======================================================================
# Creating a BF from the blacklist 
# prepare a BF
filter = BF(n)

for url in blacklist:
  url = url.strip()    # remove white space
  if not url or url[0] == '#':    # we need to ignore empty strings or comments
    continue 
  url =  url.encode() # Call encode() to convert string into bytes  
  filter.add(url)  

# Let's check how big is the size of BF B
print("BF size is", len(filter.B), "bytes!")


#======================================================================
# Inspect the observed domains based on BF
observed = open("observed.txt").readlines()
for url in observed:
  url = url.strip()
  if not url or url[0] == '#': 
    continue 
  url = url.encode() 
  if filter.check( url ):  # Now call check function
      print( url )

Your code should have the following results:

$ wc suspicious.txt
 1870  1940 50164 suspicious.txt
$ python3 test_bf.py
BF size is 3740 bytes!
b'gsebqsi.ru'
b'pagaldaily.com'
b'4kqd3hmqgptupi3p.nameuser.site'
b'4kqd3hmqgptupi3p.chargecar.vip'
b'celticno1.dish'
b'cerberhhyed5frqa.fkri48.win'
b'4kqd3hmqgptupi3p.k7oud1.top'
b'teamoxsiempre.android'

It's quite amazing that you need only less than 4KB to store 50K data.

Accuray and Space Efficiency of BF

You can find the actual matches using the following command:

sort suspicious.txt > sorted_sus.txt
sort observed.txt > sorted_obs.txt
comm -12 sorted_sus.txt sorted_obs.txt

The first sort comannd will sort the lines in suspicoius.txt and store them in sorted_sus.txt (using the redirection).
Likewise, you will have a sorted observation file from the second command.
The comm command will compare the two files.
- -1 option will supress printing the lines only appearing in sorted_sus.txt.
- -2 option will supress printing the lines only appearing in sorted_obs.txt.
- Therefore, comm will print out only the lines appearing in both files.

The output will be as follows:

#
#
#
#
#
#
#
#
4kqd3hmqgptupi3p.chargecar.vip
4kqd3hmqgptupi3p.k7oud1.top
4kqd3hmqgptupi3p.nameuser.site
cerberhhyed5frqa.fkri48.win
gsebqsi.ru
pagaldaily.com

There are roughly 3700 domains in the observed list. We have two false positive cases: celticno1.dish and teamoxsiempre.android in the BF output. Note that 2/3700 is roughly 0.00054, which is quite consistent with the false positive parameter.

[15pts] Writing a Lab Report

Please explain and show your work. Write a lab report by using the provided template (check the lab ground rules). The writing quality of the lab report matters.

Submit bf.py as shown in the submit command below.
For each part, explain your ideas and work in the lab report.

~/bin/submit -c=IT432 -p=lab01 bf.py lab01_report.docx