Project 1: Tweet Search
Automated ways of searching text are now used across the Web, within institutions, and across critical services like the healthcare industry. This project will have you write a basic search tool to filter large amounts of textual data in a fraction of a second. It would take a human days to do the same thing by hand.
We are providing to you almost 200 thousand actual public tweets from Twitter. Your program will read them all, and provide a search interface. Dr. Chambers has a continual feed into Twitter that downloads millions of tweets every day. This is just a fraction of a fraction of the data that we have here in the CS department. We use it to do fundamental research in artificial intelligence and information extraction, such as finding correlations with presidential approval polls (some of your elder students have conducted such research). This short project will let you have some fun poking around the data.
Due date: Feb 14, 1700, submit with the normal submit script.
Honor: see course policy. You may not discuss nor help with any aspect of this project with any other person.
Step 1 (50 pts)
Input to your project's code will come from a single text file. The file contains one tweet per line. Each line contains a single tweet's 3 data fields (tweet, username, date) separated by tab characters. You will write code for this Step 1 that reads this file, creates Tweet objects, and stores them in an array. Here is an example of a line in the file:
omg i can't believe you said that!!! #whoknows stargazerz 2013-03-20
Create a Tweet class.
You must write a class definition for a Tweet object that stores 5 pieces of data:
- String: the text of the tweet
- String: the username of the person who wrote it
- int: the year the tweet was sent
- int: the month the tweet was sent
- int: the day of the month the tweet was sent
All tweet variables must be private to avoid other classes tampering with them by accident. Since the variables are private, you'll need to write some get methods to return their values to whoever will want to read them: getText(), getYear(), getMonth(), etc.
Make a constructor that uses this exact prototype definition:
public Tweet(String newtext, String newuser, String newdate);
Remember that the constructor's sole job is to initialize the 5
variables in the Tweet object. You will use these three
parameters to fill in all 5 member variables.
You will need to split that String date into three int values. How do we
do that in Java? You already know about the .split() method to
break it up into 3 parts. This will result in a String for the
year like "2013". To convert this to an int, Java provides a
static method that you can call:
Integer.parseInt("2013") ==> 2013
Finally, create a member method in Tweet: String toString(). This method returns a String that represents the Tweet's data. The method does not print anything, it simply builds and returns a String. What should the string look like? Follow this format (use a tab character to separate the 3 output fields:
tweet text goes first [usernameInBrackets] 1/30/2013
Create your main program.
Create Search.java with a main() method. Your main method will get the file path from the command line, and then call another method readFile(path). You must write readFile: make it a static method inside Search.java that reads the text file of tweets, line by line: Tweet readFile(String path).
It must construct and return an array of Tweets.
To use an array, you need to initialize it to a certain size!
We will hardcode the size for this step: size 33 (this is how many tweets are in the file sometweets.txt).
Then write code to open the file, read its lines, and fill the Tweet array. How do we read files in Java? See the below HOWTO box in yellow.
In main(), call readFile(path), save the returned Tweet array. Print the size of the array, and then write a for loop to print all the tweets. Make sure this works before moving on!
Make sure your running program looks exactly like the following. Note that the date format is different from the input. The year is listed last, and all fields are separated by slashes:
> java Search sometweets.txt Array size: 33 i kicked daniels knee [st0rmcl0aks] 8/11/2013 rt @jvson_: i think it's horrible that people feel embarrassed to take btec now because of how it's mocked on here [deebaybiie] 8/11/2013 @ally_b237 asehh..your bus has not yet touch down?? [wiz_ked] 8/12/2013 ... ... rt @coralynencs: @alexxmathias mdddddddddddddr [alexxmathias] 8/12/2013
STOP: save your progress!
Create a subdirectory called step1/ inside your current Project1/ directory. Copy all .java files into it, and nothing else. You will do this at each step.
Step 2 (20 pts): Use a Queue
The previous step reads the file into an array of known size. This is not ideal because we want to be able to read arbitrary sizes of files. It is also not ideal because arrays are difficult to resize, and expensive to duplicate. This step replaces the Tweet array, and uses a Queue instead.
Reuse Your Old Queue and Node.
Now that you have a Tweet object to easily store tweets, you will now use a Queue to store those tweet objects...in fact, you should use your past lab's code! Previously our Queue used Nodes, and the Nodes contained a single String data. The only thing changing here is that we don't have String data, we now have Tweet data.
- Modify Node.java to store a Tweet object.
- Move your static pop/push/peek methods to be member methods in Queue. Queue should still operate over Node objects! Change those methods to work with Tweet objects instead of String objects. Both pop/peek should return a Tweet. None of these should be static methods.
After you've updated your Queue to work with Tweets, you will want to add a few other methods that will be useful in your main program below:
- Create a member method in Queue: void printAll(). This should loop over the entire queue and call Tweet's toString() method above to print the tweets out. Don't alter the queue itself! Just print every tweet in order!
- Create a member method in Queue: int currentSize(). This should return how many Nodes are currently in the Queue. It is up to you to decide how best to implement this.
Update Search.java with the Queue.
Now update main() and readFile() to build a Queue of Tweets instead of filling an array of Tweets.
> java Search sometweets.txt Queue size: 33 i kicked daniels knee [st0rmcl0aks] 8/11/2013 rt @jvson_: i think it's horrible that people feel embarrassed to take btec now because of how it's mocked on here [deebaybiie] 8/11/2013 @ally_b237 asehh..your bus has not yet touch down?? [wiz_ked] 8/12/2013 ... ... rt @coralynencs: @alexxmathias mdddddddddddddr [alexxmathias] 8/12/2013
STOP: save your progress!
Create a subdirectory called step2/ inside your current Project1/ directory. Copy all .java files into it, and nothing else.
Step 3 (15 pts): Keyword Filter
Now we add search functionality! The user will enter search words, and you'll create a new queue with only matching tweets. You will change main() to allow for user input, as in previous labs and homeworks. You will prompt the user with a question mark "? ", and the user will type single search query words.
When the user enters a search word, you must use your current queue to create a brand new queue. This new queue will contain only those tweets that contain the search keyword. The trick here is that you are not modifying the original queue of tweets. You are creating a new one with its own Node objects. However, these new Node objects will point to the same Tweet objects in the original queue! You will "share" them in memory. See the picture on the right.
Allow for one user special input, "!dump". If the user types this, then you must print out your entire current queue to the screen. Print all the tweets.
To achieve this end, implement the following:
- Write a member method in Tweet: boolean containsKeyword(String keyword).
This returns true if the Tweet's text contains the given word. You may find the String method indexOf(String) helpful. See the javadocs for more.
Also: you want to compare your keyword to a lowercased version of the tweet's text. Use the String's .toLowerCase() method before comparing.
- Write a member method in Queue: Queue filterForKeyword(String keyword)
This should create a new queue with all matching tweets, and returns the new queue.
IMPORTANT: your original Queue object should not be changed!
Think about how to traverse the queue without moving the front/back pointers.
Alter main() to prompt and allow for user input as described above. The end result should follow this output exactly:
> java Search sometweets.txt Queue size: 33 ? you Queue size: 7 ? !dump @ally_b237 asehh..your bus has not yet touch down?? [wiz_ked] 8/12/2013 @_xratedxbeauty lol how far is you? [ayoo_imbadx] 8/12/2013 a dream is a wish your heart makes. [emilyy_gant] 8/12/2013 rt @hayescrazed_xo: @hayniacs2327 you're welcome. you're welcome. you're welcome. [hayniacs2327] 8/12/2013 @blackieechannn who do you have [ashleyyymariek] 8/12/2013 you can only know the time you go to bed, but you can never know the time you sleep.. rt if u agree" [tolu1786] 8/12/2013 @stephanieirvine remember that time i converted you to fnl? [wingster55] 8/12/2013 Queue size: 7 ? dream Queue size: 1 ? !dump a dream is a wish your heart makes. [emilyy_gant] 8/12/2013 Queue size: 1 ?
Now you're ready to try the big boy file:
> java Search alltweets.txt Queue size: 188671 ? happy Queue size: 1647 ? everyone Queue size: 19 ? tired Queue size: 1 ? !dump RT @enahzxo_: I'm so tired of trying to make everyone else happy. [_TimiciaAri] 8/16/2013 Queue size: 1 ?
And one more for fun:
> java Search alltweets.txt Queue size: 188671 ? navy Queue size: 42 ? rihanna Queue size: 10 ? cheer Queue size: 1 ? !dump RT @chaneIrihanna: cheer up navy #mtvhottest Rihanna [Fentyisdahottes] 8/14/2013 Queue size: 1 ?
STOP: save your progress!
Create a subdirectory called step3/ inside your current Project1/ directory. Copy all .java files into it, and nothing else.
Step 4 (5 pts): Non-Keyword Filter
This step allows users to enter negated keywords. This means that they can enter a word such that you remove all tweets that actually contain the keyword. The end result is a list of tweets where none include the keyword.
Change your program to allow for this second type of query. The user input will be preceded by a minus sign, such as "-sad". You will then create a new queue from the current queue, but this time only keep those tweets that do not have the given word (e.g., tweets without "sad" in them).
- Write a member method in Queue: Queue filterForNotKeyword(String keyword)
- Change main() to allow for keywords that start with a minus sign: "-happy"
> java Search tweets.txt Queue size: 33 ? the Queue size: 8 ? -a Queue size: 1 ? !dump under the influence of music. [zeandercarter] 8/12/2013 Queue size: 1 ?
Here is one on the big file:
> java Search alltweets.txt Queue size: 188671 ? happy Queue size: 1647 ? world Queue size: 31 ? -birthday Queue size: 26 ? :) Queue size: 3 ? !dump Good morning :)"@ratihibrahim: Good morning world, good morning good people, good morning happy sunday...." [murty_pane] 8/17/2013 RT @tcookin: Happy World Photography Day guys! Keep travelling and keep clicking! :) [vikrant7985] 8/19/2013 @Real_Liam_Payne If you'restill online and you read this please follow me?:)x You would make me SOhappy<3ILY so muchx You guys are myworldxX [MelissaTweets13] 8/20/2013 Queue size: 3
STOP: save your progress!
Create a subdirectory called step4/ inside your current Project1/ directory. Copy all .java files into it, and nothing else.
Step 5 (5 pts): Date Filter
Now we will add a filter based on the day that the tweet was tweeted. The user can type in a date, and you must then create a new queue from the current queue that only keeps tweets that occurred on the given day. Days will be entered with a plus sign (+) in front of them. The format will be:
To accomplish this, write a member method in Queue: Queue filterForDate(String date)
This filter should behave like the previous steps, but this time it only keeps the Tweets that occurred on the given day. You'll obviously have to split the String date up in your method, and compare it to the Tweet object's int fields.
> java Search sometweets.txt Queue size: 33 ? +2013-8-11 Queue size: 2 ?
> java Search alltweets.txt Queue size: 188671 ? omg Queue size: 1089 ? +2013-8-17 Queue size: 102 ?
STOP: save your progress!
Create a subdirectory called step5/ inside your current Project1/ directory. Copy all .java files into it, and nothing else.
Step 6 (5 pts): Reset
Finally, add !reset and !quit options. The !quit option terminates the program. The !reset option lets the user start back at the original queue. Ignore everything that has been searched so far, and begin over again. Do NOT re-read the file from disk. You should always keep the original queue around, and none of your methods should have modified it if you did the above steps correctly.
> java Search alltweets.txt Queue size: 188671 ? army Queue size: 103 ? !reset Queue size: 188671 ? navy Queue size: 42 ? !quit Goodbye!
STOP: save your progress!
Create a subdirectory called step6/ inside your current Project1/ directory. Move all .java files into it, and nothing else.
STOP: you're finished!
Have you commented your code? Does every method have comments before it? Is your code consistently and uniformly indented? We will look in your latest stepN/ directory, so you only have to make sure that code is up to standards.
Name your main directory Project1/ if you haven't already. Keep all of your subdirectories stepN in place. Please move the tweet files out of your Project1/ directory and do not submit them.
Finally, submit your project using the standard submit instructions.