SI485i, Fall 2012

NLP Course Projects: Requirements


You will come up with a current NLP challenge, and you will attempt to solve that challenge. Solving the problem is not a requirement, but effort and correct application of NLP techniques is. Even if you think your idea is crazy, please talk to me about it. You'd be surprised what kind of data we can get our hands on.

Important Dates

Oct 9th: By COB, email your instructor the following: team members, a project title, a one paragraph description, and include what/where your data might come from.

Dec 3rd: Class presentation: 10 minutes

Dec 7th: Project writeup due by COB, code submitted electronically by COB.

Overlap with Capstones/Classes

You may work on a problem related to your capstone or an assignment in another class. However, this project must be a unique task that would not have been otherwise accomplished in that other class. It must be in addition to outside obligations. You cannot double count your work across courses. The honor concept applies: you may not take credit for a project that is also submitted elsewhere (even if it is your own!).


  1. Data Collection: Your project must involve a significant dataset of text (a corpus). Significant means the size of your corpus must be large enough that a single human could not solve the task in a short period of time. You may use someone else's corpus, or you can create your own. Corpus creation is often as interesting as the task itself, so you will get credit for creating new datasets if that is the route you choose. There are also many freely available corpora online.
  2. Algorithm Complexity: This is the meat of the project. The algorithm and the amount of effort put into it will determine the largest chunk of your grade. Everyone will have a different system, so there are not standards required except that you will probably have (1) a learning component, (2) a text processing component, and (3) an evaluation component. The best projects will have multiple learning components, interacting in different ways on different language phenomena.
  3. Error Analysis: Creating a working project is not enough. You will also look at your output, and analyze when your system is wrong, and when it is right. You will submit a writeup with examples of correct/incorrect test cases, as well as reasons for their occurrence.
  4. Brief Presentation. An in-class presentation of results. See below for requirements.
  5. Technical Writeup. A multi-page writeup describing the task, the challenge, why it is interesting, the dataset, your algorithm, and final analysis. See below for requirements.

In-Class Presentation

Your presentation must include and follow all of the following requirements

Technical Writeup

Your final paper will be a technical writing piece. You must have the following sections, at a minimum:

You can of course have other sections.

Your final paper should be of sufficient length. Two pages is not sufficient. If you adequately describe your algorithm, all the parameters in it, how you set your parameters, what features you used, why you used those features, etc. ... then you will have no trouble filling at least 4 pages (12 point font, single spaced).

Word of advice: include the variants of your algorithm that failed. Explain what you tried, say it didn't work, and give a word about why you think it didn't work.


Progress Reports: 16%
Writeup: 10%
Presentations: 5%
Project code, algorithm, experiments, etc.: 69%

(NOTE: if your writeup is not clear, it will hurt your project's core grade. The 69% of the grade is largely dependent on the quality/clarity of the 10% writeup.)

Project Ideas

Here is a sample of ideas and projects that have been used in similar NLP courses. You should feel free to come up with your own. Use these as a helpful idea generator.

Build a system that can have a conversation with you. The user types messages, and your system replies based on the user's text. Many approaches here ... you could use a large twitter corpus and do language similarity

Information Extraction
Choose a type of text format that typically contains useful information, but in written language. For instance, classifieds on craigslist. Build a system that extracts relevant information for products being offered for sale, such as the price, make, model, etc. Your system will read the sentences and extract the key pieces of information automatically.

Lingo Definition Generator
Like urbandictionary, but automated. Given a new slang phrase, what does it mean? Use a large corpus of text to find occurrences of the slang, and cluster words/phrases in similar contexts to create a definition.

Movie Review Prediction
Use online movie reviews to predict reviews of new movies.

Predictions from Twitter
People have tried to predict situations like flu outbreaks, protests, and elections from twitter data. Come up with a different type of situation to predict, and have at it! We have one year of twitter data at the USNA, so the sky's the limit.

Quote Clustering in News
Use news articles talking about similar quotes, pull out the most talked about topics and quotes through clustering techniques.

Sentiment Analysis
Choose a domain of interest and apply sentiment analysis to it in a way that builds on what we've done in class, specific to your domain.

Song Generator
Use a corpus of actual song lyrics and automatically generate new songs, perhaps given an initial sentence from the user. Make it rhyme.

Summarize Restaurant Reviews
Take a list of reviews about a restaurant, and generate a single English summary for that restaurant. Use or some other website for the data.

Ideas from Stanford's 2011 course
Ideas from Stanford's 2010 course
NOTE: the honor concept applies. You can adopt their topics, but not take their actual implementation ideas off-the-shelf and simply reproduce their system. You must make it your own.