NLTK Library and Complicated Lists

The next lab will use a library called NLTK. Today we will discuss a little more detail about what's involved when you use ANY library, as well as the tools NLTK brings us. We'll use this as a running example to go deeper into Lists which contain other things beyond primitives. You can find several other simple examples on this tutorial webpage.

Today in Class

  1. Importing libraries (modules)
  2. Options when importing, renaming functions.
  3. English functions like word_tokenize(s) and pos_tag(s)
  4. How to use a List of Tuples.
  5. NLTK's stop word list. (if time permits)
  6. Lemmatizing words (using NLTK's WordNetLemmatizer, if time)

You should be able to understand this program.

One thing NLTK does for you is to split up English text into "tokens", words and punctuation. It also has English processors like a Part of Speech (POS) tagger that comes in handy. Below is an example program that tags all the words with their POS and pulls out just the nouns.

You can run this one from here directly:

import nltk   # of course
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')   # just data you have to download once

      
def extract_nouns(sentence):
      """ Given a string, this returns a List of strings as the sentence's nouns """
      # Try printing each of these out to see what they contain.
      tokens = nltk.word_tokenize(sentence)
      tagged = nltk.pos_tag(tokens)

      # Get the nouns!
      nouns = list()
      for wordtag in tagged:
         word,tag = wordtag
         if tag.startswith('NN'):   # If the part of speech is an NN, it's a noun
             nouns.append(word)

      return nouns
  
sent = input("Sentence? ")          # John left the store after buying some peaches.
nouns = extract_nouns(sent)
print("Your nouns are: ", nouns)