Lab 10: Authorship Detection

You're a historian who uncovered old transcripts from an unknown author. You want to align these texts with known authors and see if you can predict how likely they are to have come from them. This is actually a real task that social scientists investigate. Some of Shakespeare's works are questioned as coming from him or not. In American Revolutionary times, people frequently published under pseudonyms to hide their identities, and we can use automated techniques to show similarities between known authors and their unattributed pieces of work. You can imagine modern day usefulness too, such as aligning anonymous social media posts with real people.

In this lab, you will process snippets from real novels with real authors, and then you will see if you can write a program to predict who wrote other unattributed snippets. For the purposes of this lab, we know who wrote all the snippets, and you will simply test against the correct answers.

* Directory Setup

Install the NLTK package to help with text processing. Add Pandas for good measure:
```
conda install nltk pandas
```

Run this command from the terminal to download and extract the lab's data files:

wget https://www.usna.edu/Users/cs/nchamber/courses/sd211/lab/l10/lab10files.tgz; tar xvf lab10files.tgz

Open short-snippets-data.tsv and understand the format. There are 3 lines with 3 passages.

Lab Overview

Our approach is to count words in each piece of text, and then compare the counts against an author's known text counts. We'll find the best matching vector of word counts and declare that to be the original author.

Shakespeare: "I do hate proud men as I do hate the engend'ring of toads."
Dickens: "The candle was burning low in the socket as he rose to his feet."
Unkown: "He was, altogether, as roystering and swaggering a young gentleman as ever stood four feet six."

Who wrote the last one? If you look at just word overlap ... we see more blue words matching Dickens ... which in this case is correct! In today's lab, you will write a program to do this word comparison automatically.

The key to your approach will be counting words! You can view each piece of text as a vector of word counts. See the counts below:

	I	do	hate	proud	men	as	the	engend'ring	of	toads	candle	was	burning	he	...	four	feet	six
Shakespeare	2	2	2	1	1	1	1	1	1	1	0	0	0	0	...	0	0	0
Dickens	0	0	0	0	0	1	1	0	0	0	1	1	1	1	...	0	1	0
Unknown	0	0	0	0	0	2	0	0	0	0	0	1	0	1	...	1	1	1

You will hold these counts in a Dictionary, of course. Each piece of text will have its own Dictionary of word counts. The keys are the words, and the values are the counts of each word. We'll use a metric to compare vectors for similarity, thus matching authors with their text.

Part 1: Read and Count (50%)

Create a file part1.py

The first step is to read passages from a file, and store each passage's word counts in an NLTK dictionary object called FreqDist.

Write a program that reads a file of text snippets (and their authors), and converts each snippet to a Dictionary (FreqDist) of word counts. We've counted words before, but today you get to use the NLTK library to clean up your text. NLTK has a function called word_tokenize(str) that handles all the splitting and punctuation for you. Here is an example of its use:

import nltk
word_list = nltk.word_tokenize("I just can't even, for real.")
print(tokens)   # Prints:  ['I', 'just', 'ca', "n't", 'even', ',', 'for', 'real', '.']

NLTK also counts words for you!

# Counts the words in your list into a FreqDist object -- which is a type of Dictionary
freqdist_counts = nltk.FreqDist(word_list)

# Prints them out in a dict-readable fashion.
print( dict(freqdist_counts) )

One thing the above is missing is ignoring stop words. We don't want 'the' and 'a' to mix up our author comparisons because of their high frequency. The NLTK library helps us out because it comes with a handy-dandy "stop words list" for English. You can access it like this:

from nltk.corpus import stopwords
nltk.download(['stopwords','punkt'])
      
# List of words which you should NOT count
stops = set(stopwords.words('english'))

You will need this stop word list in your solutions.

Requirements:

Write a function called count_words(text) that takes one string argument, and returns a FreqDist object of the word counts in the argument without any stop words.
Lowercase your strings before counting.
Use Pandas to read the TSV file. Instead of read_csv(str), use read_csv(str, sep='\t') for the TSV format.
Write a program that reads the snippet file, calls your count_words(str) on each row's text, and then prints the author/counts to the terminal for each row in the snippet file. You can hard-code 'short-snippets-data.tsv' as your file for this Part 1.

Your output when running your program on short-snippets-data.tsv should look like this but your words may be in a different order when printed.

Part 2: User Program to Match Author to User (75%)

Create a new program part2.py and import your part1 count_words function. Do not copy the count_words() function definition into this part2 program. You must import it. You might find you need to make a change to how you wrote part1.py.

Write a program that matches a user's input text to the most similar author! The user will type in a sentence, and you'll tell them which author is closest in word choice. Just to motivate you up front, here is what this part's output should look like when finished:

python3 part2.py
Passage: The night is dark and full of terrors.
Most similar: SHAKESPEARE Than death and honour. Let's to supper, come, And drown consideration. Exeunt ACT_4|SC_3 SCENE III. Alexandria. Before CLEOPATRA's palace Enter a company of soldiers FIRST SOLDIER. Brother, good night. To-morrow is the day. SECOND SOLDIER. It will determine one way. Fare you well. Heard you of nothing strange about the streets? FIRST SOLDIER. Nothing. What news? SECOND SOLDIER. Belike 'tis but a rumour. Good night to you. FIRST SOLDIER. Well, sir, good night. [They meet other soldiers] SECOND SOLDIER. Soldiers, have careful watch. FIRST SOLDIER. And you. Good night, good night. [The two companies separate and place themselves in every corner of the stage] SECOND SOLDIER. Here we. And if to-morrow Our navy thrive, I have an absolute hope Our landmen will stand up. THIRD SOLDIER. 'Tis a brave army, And full of purpose. [Music of the hautboys is under the stage] SECOND SOLDIER. Peace, what noise? THIRD SOLDIER. List, list! SECOND SOLDIER. Hark! THIRD SOLDIER. Music i' th' air. FOURTH SOLDIER. Under the earth. THIRD SOLDIER. It signs well, does it not? FOURTH SOLDIER. No. THIRD SOLDIER. Peace, I say! What should this mean? SECOND SOLDIER. 'Tis the god Hercules, whom Antony lov'd, Now leaves him. THIRD SOLDIER. Walk; let's see if other watchmen Do hear what we do. SECOND SOLDIER. How now, masters! SOLDIERS. [Speaking together] How now! How now! Do you hear this? FIRST SOLDIER. Ay; is't not strange? THIRD SOLDIER. Do you hear, masters? Do you hear? FIRST SOLDIER. Follow the noise so far as we have quarter; Let's see how it will give off. SOLDIERS. Content. 'Tis strange. Exeunt

python3 part2.py
Passage: It is not down on any map; true places never are.
Most similar: SHAW CATHERINE (relenting). Ah! (Stretches her hand affectionately across the table to squeeze his.) PETKOFF. And how have you been, my dear? CATHERINE. Oh, my usual sore throats, that's all. PETKOFF (with conviction). That comes from washing your neck every day. I've often told you so. CATHERINE. Nonsense, Paul! PETKOFF (over his coffee and cigaret). I don't believe in going too far with these modern customs. All this washing can't be good for the health: it's not natural. There was an Englishman at Phillipopolis who used to wet himself all over with cold water every morning when he got up. Disgusting! It all comes from the English: their climate makes them so dirty that they have to be perpetually washing themselves. Look at my father: he never had a bath in his life; and he lived to be ninety-eight, the healthiest man in Bulgaria. I don't mind a good wash once a week to keep up my position; but once a day is carrying the thing to a ridiculous extreme. CATHERINE. You are a barbarian at heart still, Paul. I hope you behaved yourself before all those Russian officers. PETKOFF. I did my best. I took care to let them know that we had a library. CATHERINE. Ah; but you didn't tell them that we have an electric bell in it? I have had one put up. PETKOFF. What's an electric bell? CATHERINE. You touch a button; something tinkles in the kitchen; and then Nicola comes up. PETKOFF. Why not shout for him? CATHERINE. Civilized people never shout for their servants. I've learnt that while you were away.

python3 part2.py
Passage: Some hae meat and canna eat, -- And some wad eat that want it; But we hae meat, and we can eat, Sae let the Lord be thankit.
Most similar: CONRAD chance--barring, of course, the killing him there and then, which wasn't so good, on account of unavoidable noise. But his soul was mad. Being alone in the wilderness, it had looked within itself, and, by heavens! I tell you, it had gone mad. I had--for my sins, I suppose--to go through the ordeal of looking into it myself. No eloquence could have been so withering to one's belief in mankind as his final burst of sincerity. He struggled with himself, too. I saw it,--I heard it. I saw the inconceivable mystery of a soul that knew no restraint, no faith, and no fear, yet struggling blindly with itself. I kept my head pretty well; but when I had him at last stretched on the couch, I wiped my forehead, while my legs shook under me as though I had carried half a ton on my back down that hill. And yet I had only supported him, his bony arm clasped round my neck--and he was not much heavier than a child. "When next day we left at noon, the crowd, of whose presence behind the curtain of trees I had been acutely conscious all the time, flowed out of the woods again, filled the clearing, covered the slope with a mass of naked, breathing, quivering, bronze bodies. I steamed up a bit, then swung down-stream, and two thousand eyes followed the evolutions of the splashing, thumping, fierce river-demon beating the water with its terrible tail and breathing black smoke into the air. In front of the first rank, along the river, three men, plastered with bright red earth from head to foot, strutted to and fro restlessly. When we came abreast again, they faced the river, stamped their feet, nodded their horned heads, swayed their scarlet bodies; they shook towards the fierce river-demon a bunch of black feathers, a mangy skin with a pendent tail--something that looked like a dried gourd; they shouted periodically together strings of amazing words that resembled no sounds of human language; and the deep murmurs of the crowd, interrupted suddenly, were like the response of some satanic litany.

In order to do this, change your part2 program so that instead of printing each author+dictionary, you save your authors and word counts in two lists:

# NOTE: these two lists will be the same length, right?
# ...an author name for each passage, and a dictionary for each passage.
authors = [ 'DICKENS', 'DICKENS', 'AUSTEN', 'AUSTEN', 'HAWTHORNE', ... ]
counts = [ FreqDist, FreqDist, FreqDist, FreqDist, FreqDist, ... ]

You'll read a sentence from the user, convert that to its own dictionary of counts, and then loop over all your observed passages to see which earlier text is most similar. Sound good? Read this again slowly if not.

In order to do this, we just need a mechanism to compare two Dictionaries, right? That's how we'll compare the user's text to an author's text. As we said above, you can think of these dictionaries as vectors where the cells are counts of words in English. Here is the Dickens text with the Unknown text again:

	I	do	hate	proud	men	as	the	engend'ring	of	toads	candle	was	burning	he	...	four	feet	six
Dickens	0	0	0	0	0	1	1	0	0	0	1	1	1	1	...	0	1	0
Unknown	0	0	0	0	0	2	0	0	0	0	0	1	0	1	...	1	1	1

You can see they both contain the words 'as', 'was', 'he', and 'feet'. Would you conclude that these two vectors are similar? How similar? How do we decide in a quantifiable way? We need a type of similarity metric that calculates the distance/similarity between two vectors. There are several options for this, such as Euclidean distance (dist in the image at right) or computing the cosine of the angle between them (cos in the image).

Cosine distance is commonly used because it normalizes the lengths of the vectors. The smaller the angle, the more word overlap between the texts. Cosine distance ranges from 1 (perfect word overlap) to 0 (no overlap). We are not requiring you to write this function, so instead we wrote authorlab.py for you. Copy that library to a file named authorlab.py. You just call it with two counts:

import authorlab

d1 = FreqDist(some_text)
d2 = FreqDist(another_text)
simscore = authorlab.cosine_sim(d1,d2)

If you have two FreqDist objects of word counts, you just call the cosine_sim() function as above, and it will give you the similarity score between the two.

Your task in this part is to read all texts from the file, save their FreqDist counts in a list, and then ask the user for a sentence. Find the best matching author to the user's sentence.

Requirements:

Use train-snippets-data.tsv now instead of short-snippets-data.tsv
Save all lines in the input file as Dictionaries (make a List of Dictionaries)
Ask the user for a sentence.
Find the best match against the user's input. Print it out with the author and the best passage that matched.

Part 3: Bulk Search Best Matching Authors (95%)

Copy part2.py to part3.py

Your last program will now do full text matching, not from a user's small input. Instead you will read UNKNOWN texts from a file, and print out the best matching author. It's similar to Part 2, but you'll read a file of unknown texts instead of reading one sentence from user input.

Requirements:

Load train-snippets-data.tsv as before as your main texts.
Read a filename from the user
Read each line from the file, and find the best matching author text from train-snippets-data.tsv
Print an author for each line of the test file.

Required Output: (except for the '# wrong')

Test Filename: ten-snippets-data.tsv
File contains 10 lines.
HAWTHORNE           # wrong (Eliot)
JAMES
SHAKESPEARE
HAWTHORNE           # wrong (Dickens)
AUSTEN              # wrong (Eliot)
AUSTEN
JAMES               # wrong (Twain)
CONRAD
CONRAD              # wrong (Austen)
CONRAD              # wrong (Hardy)

This rudimentary approach got 40% correct if you matched our output. That might seem poor, but there are 10 authors so random guessing is only 10%. You'll learn more advanced approaches in your future DS classes with machine learning and NLP.

Part 4: To Stop or Not To Stop? (100%)

Copy part3.py to part4.py

Your word counting uses a stop word list to remove high frequency common words. This is a common practice across many text processing tasks. However, should we do it for authorship detection when the authors span different time periods? Perhaps high frequency words change, and they would be a good indicator of authorship after all.

Change your part4.py so that it doesn't remove stop words. The count_words() function should count everything. How do the results change? You should have the same output format as Part 3, but the predicted authors may change. IMPORTANT: since we're importing count_words, you can't just remove how count_words() works because other programs depend on its current behavior. What are our options?

Create a new count function in part4.py
Change count_words() in part1.py to have a default parameter that allows you to toggle stop words on and off, but allows your other parts to work unchanged.

What to turn in

Visit the submit website and upload your FOUR programs.

submit -c=sd211 -p=lab10 part1.py part2.py part3.py part4.py