- Due before 23:59 Wednesday, October 5
- Submit using the submit program
301 lab 06.
Communities in America (unfortunately) have a long history of banning
controversial books. For example, this tarball
contains the full text of several often-banned books which now lie in the
public domain. (Note: If you download this file and then run
tar xzvf books.tgz, the book contents will be extracted
into a directory called "books".)
Books still get banned, but our online lives have caused our public speech to be both much more plentiful, and much more closely watched. For example, here is a list of words and phrases, revealed by a Freedom of Information Act request in 2011, which was used by the Department of Homeland Security to flag suspicious online text. I wonder which of our frequently-banned books most use words and phrases from this list?
This lab proceeds in multiple parts, and the first parts are by far the most important and the things you absolutely have to complete. Make sure you follow the specifications so your programs work correctly. You should turn in a solution for every portion you complete. In other words, if you complete part 4, I should get from you a bst.py, a one_word.py, a phrases.py, AND a search_all.py.
We're going to need a way of identifying if a phrase from our book appears
on the list. We could load up a list with the phrases, and then use
if bookphrase in badWordList:, but that would be \(O(n)\) for each
search - it will take too long with so many words in the list and in each book!
Instead, let's take advantage of the fact that strings are comparable
(meaning < and > works, based on alphabetical order), in order to
arrange our phrases into a Binary Search Tree.
bst.py, which contains a
Nodeclass and a
TreeSetclass. The Node class should just contain a constructor, which accepts only the data for the Node. The TreeSet class should have:
- a constructor (which takes no arguments other than
- a method
insert(self,phrase), which adds
phraseto the tree, and
- a method called
__contains__(self,phrase), which returns true or false, based on if
phraseappears in the tree.
If you need helper methods to make these work, that is fine, but these should be the only public methods.
Some of you probably recognize that the double-underscores means
__contains__ method has a special purpose, and you'd be right.
in operator. By this, I mean that if you have an object
myTreeSet, and you call
someString in myTreeSet, it will
call your __contains__ method just like you wrote
We'll see more operator overloading in the next lab.
Create a program called
which imports your classes from
bst.py, and fills a tree with each
line from a file called
badwords.txt (you should probably download the one linked above!).
Many of these lines are
multiple words long; that is fine, add the whole phrase as a single data point
to the tree.
Your program should then take a file name from the command-line arguments, and search for each individual word from the book in the tree of bad words; if it appears, it should print the word. At the end, it should print the number of words that appeared.
$ python3 one_word.py books/huckleberryFinn.txt gang pork exercise sick shooting (...lots more...) who facility help Count: 362
Doing this will require that you strip all punctuation from your words from the books, and make them lower-case. Google can probably help you figure out how to do that!
(In case it's helpful, here is a list of all the punctuation characters that
appear in the sample book files:
phrases.pywhich does all of part 2, as well as searching for phrases. The longest phrases in the list of bad words are five words long.
$ python3 phrases.py books/ulysses.txt shooting china who swine who (...more...) foot and mouth (...more...) secret service (...more...) who facility help Count: 1289
The rest of this lab is optional for your enrichment and enjoyment. Be sure to work on this only after getting everything else perfect. You should still submit this work, but make sure it doesn't break any of the parts you already had working correctly!
Create a program called
which displays the number of offending phrases in each of several command-line
arguments. The filenames on the command-line should be ordered according to the
"offense ratio", which is the number of bad phrases divided by the total number of words
in the book. List the most offensive book first.
$ python3 rank.py books/*.txt books/uncleTomsCabin.txt: 936/191115 books/ulysses.txt: 1289/272816 books/throughTheLookingGlass.txt: 104/33924 books/huckleberryFinn.txt: 366/120070 books/originOfSpecies.txt: 265/160071