SI485i, Fall 2012
You will come up with a current NLP challenge, and you will attempt to solve that challenge. Solving the problem is not a requirement, but effort and correct application of NLP techniques is. Even if you think your idea is crazy, please talk to me about it. You'd be surprised what kind of data we can get our hands on.
Oct 9th: By COB, email your instructor the following: team members, a project title, a one paragraph description, and include what/where your data might come from.
Dec 3rd: Class presentation: 10 minutes
Dec 7th: Project writeup due by COB, code submitted electronically by COB.
You may work on a problem related to your capstone or an assignment in another class. However, this project must be a unique task that would not have been otherwise accomplished in that other class. It must be in addition to outside obligations. You cannot double count your work across courses. The honor concept applies: you may not take credit for a project that is also submitted elsewhere (even if it is your own!).
Your presentation must include and follow all of the following requirements
Your final paper will be a technical writing piece. You must have the following sections, at a minimum:
You can of course have other sections.
Your final paper should be of sufficient length. Two pages is not sufficient. If you adequately describe your algorithm, all the parameters in it, how you set your parameters, what features you used, why you used those features, etc. ... then you will have no trouble filling at least 4 pages (12 point font, single spaced).
Word of advice: include the variants of your algorithm that failed. Explain what you tried, say it didn't work, and give a word about why you think it didn't work.
Progress Reports: 16%
Project code, algorithm, experiments, etc.: 69%
(NOTE: if your writeup is not clear, it will hurt your project's core grade. The 69% of the grade is largely dependent on the quality/clarity of the 10% writeup.)
Here is a sample of ideas and projects that have been used in similar NLP courses. You should feel free to come up with your own. Use these as a helpful idea generator.
Build a system that can have a conversation with you. The user types messages, and your system replies based on the user's text. Many approaches here ... you could use a large twitter corpus and do language similarity
Choose a type of text format that typically contains useful information, but in written language. For instance, classifieds on craigslist. Build a system that extracts relevant information for products being offered for sale, such as the price, make, model, etc. Your system will read the sentences and extract the key pieces of information automatically.
Lingo Definition Generator
Like urbandictionary, but automated. Given a new slang phrase, what does it mean? Use a large corpus of text to find occurrences of the slang, and cluster words/phrases in similar contexts to create a definition.
Movie Review Prediction
Use online movie reviews to predict reviews of new movies.
Predictions from Twitter
People have tried to predict situations like flu outbreaks, protests, and elections from twitter data. Come up with a different type of situation to predict, and have at it! We have one year of twitter data at the USNA, so the sky's the limit.
Quote Clustering in News
Use news articles talking about similar quotes, pull out the most talked about topics and quotes through clustering techniques.
Choose a domain of interest and apply sentiment analysis to it in a way that builds on what we've done in class, specific to your domain.
Use a corpus of actual song lyrics and automatically generate new songs, perhaps given an initial sentence from the user. Make it rhyme.
Summarize Restaurant Reviews
Take a list of reviews about a restaurant, and generate a single English summary for that restaurant. Use Yelp.com or some other website for the data.
Ideas from Stanford's 2011 course
Ideas from Stanford's 2010 course
NOTE: the honor concept applies. You can adopt their topics, but not take their actual implementation ideas off-the-shelf and simply reproduce their system. You must make it your own.