NLTK Library and Complicated Lists

The next lab will use a library called NLTK. Today we will discuss a little more detail about what's involved when you use ANY library, as well as the tools NLTK brings us. We'll use this as a running example to go deeper into lists that contain other things beyond primitives. You can find several other simple examples on this tutorial webpage.

Today in Class

  1. Importing NLTK and downloading data
  2. English functions like word_tokenize(s) and pos_tag(L)
  3. Review: lists of tuples.
  4. NLTK's stop word list
  5. NLTK's FreqDist

Install NLTK

conda install nltk

You should be able to understand this program.

One thing NLTK does for you is to split up English text into "tokens", words and punctuation. It also has English processors like a Part of Speech (POS) tagger that comes in handy. Below is an example program that tags all the words with their POS and pulls out just the nouns.

You can run this one from here directly:

import nltk   # of course
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')   # machine learned model you download once

      
def extract_nouns(sentence):
      """ Given a string, this returns a List of strings as the sentence's nouns """
      # Try printing each of these out to see what they contain.
      tokens = nltk.word_tokenize(sentence)
      tagged = nltk.pos_tag(tokens)

      # Get the nouns!
      nouns = list()
      for word,tag in tagged:
         if tag.startswith('NN'):   # If the part of speech is an NN, it's a noun
             nouns.append(word)

      return nouns
  
sent = input("Sentence? ")          # John left the store after buying some peaches.
nouns = extract_nouns(sent)
print("Your nouns are: ", nouns)