Project 1: Tweet Search

Automated ways of searching text are now used across the Web, within institutions, and across critical services like the healthcare industry. This project will have you write a basic search tool to filter large amounts of textual data in a fraction of a second. It would take a human days to do the same thing by hand.

We are providing to you almost 200 thousand actual public tweets from Twitter. Your program will read them all, and provide a search interface. Dr. Chambers has a continual feed into Twitter that downloads millions of tweets every day. This is just a fraction of a fraction of the data that we have here in the CS department. We use it to do fundamental research in artificial intelligence and information extraction, such as finding correlations with presidential approval polls (some of your elder students have conducted such research). This short project will let you have some fun poking around the data.

Due date: Feb 14, 1700, submit with the normal submit script.

Honor: see course policy. You may not discuss nor help with any aspect of this project with any other person.


Disclaimer
You may not distribute this project's data to anyone beyond the USNA. Our agreement with Twitter prevents copying and distributing.
This is raw, real world data. The standard disclaimer applies as it does whenever we step out onto the Web. You may come across offensive material. Please behave like mature adults and future officers, as appropriate.


Step 1 (50 pts)

Input Data

Input to your project's code will come from a single text file. The file contains one tweet per line. Each line contains a single tweet's 3 data fields (tweet, username, date) separated by tab characters. You will write code for this Step 1 that reads this file, creates Tweet objects, and stores them in an array. Here is an example of a line in the file:

omg i can't believe you said that!!! #whoknows       stargazerz      2013-03-20

Do this now: create a Project1/ directory for your code.
Download the two tweet data files into your new directory: all tweets (15mb) and some tweets (2k)
Peek inside and take a look at the data.

Create a Tweet class.

You must write a class definition for a Tweet object that stores 5 pieces of data:

  1. String: the text of the tweet
  2. String: the username of the person who wrote it
  3. int: the year the tweet was sent
  4. int: the month the tweet was sent
  5. int: the day of the month the tweet was sent

All tweet variables must be private to avoid other classes tampering with them by accident. Since the variables are private, you'll need to write some get methods to return their values to whoever will want to read them: getText(), getYear(), getMonth(), etc.

Make a constructor that uses this exact prototype definition:

public Tweet(String newtext, String newuser, String newdate);

Remember that the constructor's sole job is to initialize the 5 variables in the Tweet object. You will use these three parameters to fill in all 5 member variables. You will need to split that String date into three int values. How do we do that in Java? You already know about the .split() method to break it up into 3 parts. This will result in a String for the year like "2013". To convert this to an int, Java provides a static method that you can call:
Integer.parseInt("2013") ==> 2013

Finally, create a member method in Tweet: String toString(). This method returns a String that represents the Tweet's data. The method does not print anything, it simply builds and returns a String. What should the string look like? Follow this format (use a tab character to separate the 3 output fields:

tweet text goes first     [usernameInBrackets]    1/30/2013

Create your main program.

Create Search.java with a main() method. Your main method will get the file path from the command line, and then call another method readFile(path). You must write readFile: make it a static method inside Search.java that reads the text file of tweets, line by line: Tweet[] readFile(String path). It must construct and return an array of Tweets. To use an array, you need to initialize it to a certain size! We will hardcode the size for this step: size 33 (this is how many tweets are in the file sometweets.txt).
Then write code to open the file, read its lines, and fill the Tweet[] array. How do we read files in Java? See the below HOWTO box in yellow.

In main(), call readFile(path), save the returned Tweet[] array. Print the size of the array, and then write a for loop to print all the tweets. Make sure this works before moving on!

Expected Output

Make sure your running program looks exactly like the following. Note that the date format is different from the input. The year is listed last, and all fields are separated by slashes:

> java Search sometweets.txt
Array size: 33
i kicked daniels knee	[st0rmcl0aks]	8/11/2013
rt @jvson_: i think it's horrible that people feel embarrassed to take btec now because of how it's mocked on here	[deebaybiie]	8/11/2013
@ally_b237 asehh..your bus has not yet touch down??	[wiz_ked]	8/12/2013
...
...
rt @coralynencs: @alexxmathias mdddddddddddddr	[alexxmathias]	8/12/2013

STOP: save your progress!

Create a subdirectory called step1/ inside your current Project1/ directory. Copy all .java files into it, and nothing else. You will do this at each step.

HOWTO: Read Files in Java

Files are easy with the familiar Scanner class. Instead of creating a Scanner for System.in like you've been doing, we instead create a Scanner for a File. You'll need to import two class for this:

import java.util.Scanner;
import java.io.File;

Then create a Scanner that wraps the File:

Scanner in = new Scanner(new File("../path/to/tweet-file.txt"), "utf-8");

You now use the Scanner like normal. You can also make use of in.hasNextLine() which tells you if there are any more lines in the file remaining. This can control a loop that reads a new line each time around the loop.

There is one catch. Files might not exist, or have private permissions, so code will break if it tries to read from the file. Java throws what is called an "exception" when such run-time errors occur. Your code needs to "catch" the exception. You need to wrap your code with a try/catch block like so:

try {
    Scanner in = new Scanner(new File(filename), "utf-8");
    // your code to read lines one at a time
    // ...
    // finished reading all lines
    in.close();
  } catch( Exception ex ) {
    ex.printStackTrace();
    System.exit(0);
  }

Step 2 (20 pts): Use a Queue

The previous step reads the file into an array of known size. This is not ideal because we want to be able to read arbitrary sizes of files. It is also not ideal because arrays are difficult to resize, and expensive to duplicate. This step replaces the Tweet array, and uses a Queue instead.

Reuse Your Old Queue and Node.

Now that you have a Tweet object to easily store tweets, you will now use a Queue to store those tweet objects...in fact, you should use your past lab's code! Previously our Queue used Nodes, and the Nodes contained a single String data. The only thing changing here is that we don't have String data, we now have Tweet data.

Use your past lab as a starting point, or feel free to use our solution: Lab02.java Queue.java Node.java

  1. Modify Node.java to store a Tweet object.
  2. Move your static pop/push/peek methods to be member methods in Queue. Queue should still operate over Node objects! Change those methods to work with Tweet objects instead of String objects. Both pop/peek should return a Tweet. None of these should be static methods.

After you've updated your Queue to work with Tweets, you will want to add a few other methods that will be useful in your main program below:

  1. Create a member method in Queue: void printAll(). This should loop over the entire queue and call Tweet's toString() method above to print the tweets out. Don't alter the queue itself! Just print every tweet in order!
  2. Create a member method in Queue: int currentSize(). This should return how many Nodes are currently in the Queue. It is up to you to decide how best to implement this.

Update Search.java with the Queue.

Now update main() and readFile() to build a Queue of Tweets instead of filling an array of Tweets.

Expected Output.

> java Search sometweets.txt
  Queue size: 33
i kicked daniels knee	[st0rmcl0aks]	8/11/2013
rt @jvson_: i think it's horrible that people feel embarrassed to take btec now because of how it's mocked on here	[deebaybiie]	8/11/2013
@ally_b237 asehh..your bus has not yet touch down??	[wiz_ked]	8/12/2013
...
...
rt @coralynencs: @alexxmathias mdddddddddddddr	[alexxmathias]	8/12/2013

STOP: save your progress!

Create a subdirectory called step2/ inside your current Project1/ directory. Copy all .java files into it, and nothing else.


Step 3 (15 pts): Keyword Filter

Now we add search functionality! The user will enter search words, and you'll create a new queue with only matching tweets. You will change main() to allow for user input, as in previous labs and homeworks. You will prompt the user with a question mark "? ", and the user will type single search query words.

When the user enters a search word, you must use your current queue to create a brand new queue. This new queue will contain only those tweets that contain the search keyword. The trick here is that you are not modifying the original queue of tweets. You are creating a new one with its own Node objects. However, these new Node objects will point to the same Tweet objects in the original queue! You will "share" them in memory. See the picture on the right.

Allow for one user special input, "!dump". If the user types this, then you must print out your entire current queue to the screen. Print all the tweets.

To achieve this end, implement the following:

Alter main() to prompt and allow for user input as described above. The end result should follow this output exactly:

> java Search sometweets.txt
Queue size: 33
? you
Queue size: 7
? !dump
@ally_b237 asehh..your bus has not yet touch down??	[wiz_ked]	8/12/2013
@_xratedxbeauty lol how far is you?	[ayoo_imbadx]	8/12/2013
a dream is a wish your heart makes.	[emilyy_gant]	8/12/2013
rt @hayescrazed_xo: @hayniacs2327 you're welcome. you're welcome. you're welcome.	[hayniacs2327]	8/12/2013
@blackieechannn who do you have	[ashleyyymariek]	8/12/2013
you can only know the time you go to bed, but you can never know the time you sleep.. rt if u agree"	[tolu1786]	8/12/2013
@stephanieirvine remember that time i converted you to fnl?	[wingster55]	8/12/2013
Queue size: 7
? dream
Queue size: 1
? !dump
a dream is a wish your heart makes.     [emilyy_gant]   8/12/2013
Queue size: 1
? 

Now you're ready to try the big boy file:

> java Search alltweets.txt
Queue size: 188671
? happy
Queue size: 1647
? everyone
Queue size: 19
? tired
Queue size: 1
? !dump
RT @enahzxo_: I'm so tired of trying to make everyone else happy.	[_TimiciaAri]	8/16/2013
Queue size: 1
? 

And one more for fun:

> java Search alltweets.txt
Queue size: 188671
? navy
Queue size: 42
? rihanna
Queue size: 10
? cheer
Queue size: 1
? !dump
RT @chaneIrihanna: cheer up navy #mtvhottest Rihanna	[Fentyisdahottes]	8/14/2013
Queue size: 1
?

STOP: save your progress!

Create a subdirectory called step3/ inside your current Project1/ directory. Copy all .java files into it, and nothing else.


Step 4 (5 pts): Non-Keyword Filter

This step allows users to enter negated keywords. This means that they can enter a word such that you remove all tweets that actually contain the keyword. The end result is a list of tweets where none include the keyword.

Change your program to allow for this second type of query. The user input will be preceded by a minus sign, such as "-sad". You will then create a new queue from the current queue, but this time only keep those tweets that do not have the given word (e.g., tweets without "sad" in them).

  1. Write a member method in Queue: Queue filterForNotKeyword(String keyword)
  2. Change main() to allow for keywords that start with a minus sign: "-happy"
> java Search tweets.txt
Queue size: 33
? the
Queue size: 8
? -a
Queue size: 1
? !dump
under the influence of music.	[zeandercarter]	8/12/2013
Queue size: 1
? 

Here is one on the big file:

> java Search alltweets.txt
Queue size: 188671
? happy
Queue size: 1647
? world
Queue size: 31
? -birthday
Queue size: 26
? :)
Queue size: 3
? !dump
Good morning :)"@ratihibrahim: Good morning world, good morning good people, good morning happy sunday...."	[murty_pane]	8/17/2013
RT @tcookin: Happy World Photography Day guys! Keep travelling and keep clicking! :)	[vikrant7985]	8/19/2013
@Real_Liam_Payne If you'restill online and you read this please follow me?:)x You would make me SOhappy<3ILY so muchx You guys are myworldxX	[MelissaTweets13]	8/20/2013
Queue size: 3

STOP: save your progress!

Create a subdirectory called step4/ inside your current Project1/ directory. Copy all .java files into it, and nothing else.

HOWTO: Grab substrings of a String

The java String class includes a substring method that lets you grab just a portion of the String. If you want to grab the substring starting at the nth character and ends at the end of the string, try this:

String mystring = "hello";
String substr = mystring.substring(1);
// substr is now "ello"

You can also use two arguments to grab from n to m:

String substr = mystring.substring(n,m);

Step 5 (5 pts): Date Filter

Now we will add a filter based on the day that the tweet was tweeted. The user can type in a date, and you must then create a new queue from the current queue that only keeps tweets that occurred on the given day. Days will be entered with a plus sign (+) in front of them. The format will be:

+year-month-day
For example:
+2014-1-28

To accomplish this, write a member method in Queue: Queue filterForDate(String date)
This filter should behave like the previous steps, but this time it only keeps the Tweets that occurred on the given day. You'll obviously have to split the String date up in your method, and compare it to the Tweet object's int fields.

> java Search sometweets.txt
Queue size: 33
? +2013-8-11
Queue size: 2
?
> java Search alltweets.txt
Queue size: 188671
? omg           
Queue size: 1089
? +2013-8-17    
Queue size: 102
?

STOP: save your progress!

Create a subdirectory called step5/ inside your current Project1/ directory. Copy all .java files into it, and nothing else.


Step 6 (5 pts): Reset

Finally, add !reset and !quit options. The !quit option terminates the program. The !reset option lets the user start back at the original queue. Ignore everything that has been searched so far, and begin over again. Do NOT re-read the file from disk. You should always keep the original queue around, and none of your methods should have modified it if you did the above steps correctly.

> java Search alltweets.txt
Queue size: 188671
? army
Queue size: 103
? !reset
Queue size: 188671
? navy
Queue size: 42
? !quit
Goodbye!

STOP: save your progress!

Create a subdirectory called step6/ inside your current Project1/ directory. Move all .java files into it, and nothing else.

STOP: you're finished!

Have you commented your code? Does every method have comments before it? Is your code consistently and uniformly indented? We will look in your latest stepN/ directory, so you only have to make sure that code is up to standards.

Name your main directory Project1/ if you haven't already. Keep all of your subdirectories stepN in place. Please move the tweet files out of your Project1/ directory and do not submit them.

Finally, submit your project using the standard submit instructions.