Introduction to NTLK

Text is everywhere, approximately 80% of all data is estimated to be unstructured text/rich data (web pages, social networks, search queries, documents, …) and text data is growing fast, an estimated 2.5 Exabytes every day!

We have seen how to do some basic text processing in Python, now we introduce an open source framework for natural language processing that can further help to work with human languages: NLTK (Natural Language ToolKit).

Tokenise a text

Let’s see it firstly with a basic NLP task, the usual tokenisation (split a text into tokens or words).

You can follow along with a notebook in GitHub.

The basic atomic part of each text are the tokens. A token is the NLP name for a sequence of characters that we want to treat as a group. We have seen how we can extract tokens by splitting the text at the blank spaces.
NTLK has a function word_tokenize() for it:

sampleText1 = "The Elephant's 4 legs: THE Pub! You can't believe it or can you, the believer?"

import nltk

s1Tokens = nltk.word_tokenize(sampleText1)
len(s1Tokens)
21

21 tokens extracted, which include words and punctuation.
Note that the tokens are different than what a split by blank spaces would obtained, e.g. “can’t” is by NTLK considered TWO tokens: “can” and “n’t” (= “not”) while a tokeniser that splits text by spaces would consider it a single token: “can’t”.

And we can apply it to an entire book, “The Prince” by Machiavelli that we used last time:

with open('../datasets/ThePrince.txt', 'r') as f:
  bookRaw = f.read()

bookTokens = nltk.word_tokenize(bookRaw)
bookText = nltk.Text(bookTokens) # special format
nBookTokens= len(bookTokens) # or alternatively len(bookText)

print ("*** Analysing book ***")
print ("The book is {} chars long".format (len(bookRaw)))
print ("The book has {} tokens".format (nBookTokens))
Out:
*** Analysing book ***
 The book is 300814 chars long
 The book has 59792 tokens

As mentioned above, the NTLK tokeniser works in a more sophisticated way than just splitting by spaces, therefore we got this time more tokens.

Sentences

NLTK has a built-in sentence splitter too:

text1 = "This is the first sentence. A liter of milk in the U.S. costs $0.99. Is this the third sentence? Yes, it is!"
sentences = nltk.sent_tokenize(text1)
len(sentences)
4
In: sentences
Out:
['This is the first sentence.',
 'A liter of milk in the U.S. costs $0.99.',
 'Is this the third sentence?',
 'Yes, it is!']

As you see, it is not splitting just after each full stop but check if it’s part of an acronym (U.S.) or a number (0.99).
It also splits correctly sentences after question or exclamation marks but not after commas.

sentences = nltk.sent_tokenize(bookRaw) # extract sentences
nSent = len(sentences)
print ("The book has {} sentences".format (nSent))
print ("and each sentence has in average {} tokens".format (nBookTokens / nSent))
Out:
 The book has 1416 sentences
 and each sentence has in average 42.22598870056497 tokens

Most common tokens

What are the 20 most frequently occurring (unique) tokens in the text? What is their frequency?

The NTLK FreqDist class is used to encode “frequency distributions”, which count the number of times that something occurs, for example a token.

Its most_common() method then returns a list of tuples where each tuple is of the form (token, frequency). The list is sorted in descending order of frequency.

def get_top_words(tokens):
  # Calculate frequency distribution
  fdist = nltk.FreqDist(tokens)
  return fdist.most_common()

topBook = get_top_words(bookTokens)
    # Output top 20 words
topBook[:20]
Out:
[(',', 4192), ('the', 2954), ('to', 2081), ('and', 1794), ('of', 1772), ('.', 1397), ...

Comma is the most common: we need to remove the punctuation.

Most common alphanumeric tokens

We can use isalpha() to check if the token is a word and not punctuation.

topWords = [(freq, word) for (word,freq) in topBook if word.isalpha() and freq > 400]
topWords
Out:
[(2954, 'the'), (2081, 'to'), (1794, 'and'), (1772, 'of'),
 (946, 'in'), (844, 'he'), (759, 'a'), ....

We can also remove any capital letters before tokenising:

def preprocessText(text, lowercase=True):
  if lowercase:
    tokens = nltk.word_tokenize(text.lower())
  else:
    tokens = nltk.word_tokenize(text)

  return [word for word in tokens if word.isalpha()]

bookWords = preprocessText(bookRaw)

topBook = get_top_words(bookWords)
  # Output top 20 words
topBook[:20]
Out:
[('the', 3110), ('to', 2108), ('and', 1938), ('of', 1802),
 ('in', 993), ('he', 921), ('a', 781), ....

slightly more ‘the‘ tokens, including the ones starting with capital letter.

print ("*** Analysing book ***")
print ("The text has now {} words (tokens)".format (len(bookWords)))
Out:
 *** Analysing book ***
 The text has now 52202 words (tokens)

Now we removed the punctuation and the capital letters but the most common token is “the“, not a significative word …
As we have seen last time, these are so-called stop words that are very common and are normally stripped from a text when doing these kind of analysis.

Meaningful most common tokens

A simple approach could be to filter the tokens that have a length greater than 5 and frequency of more than say 80.

meaningfulWords = [word for (word,freq) in topBook if len(word) > 5 and freq > 80]
sorted(meaningfulWords)
Out:
['against', 'always', 'because', 'castruccio', 'having',
 'himself', 'others', 'people', 'prince', ...

This would work but would leave out also tokens such as I and you which are actually significative.
The better approach – that we have seen earlier how – is to remove stopwords using external files containing the stop words.
NLTK has a corpus of stop words in several languages:

from nltk.corpus import stopwords

stopwordsEN = set(stopwords.words('english')) # english language

betterWords = [w for w in bookWords if w not in stopwordsEN]

topBook = get_top_words(betterWords)
  # Output top 20 words
topBook[:20]
Out:
[('one', 302), ('prince', 222), ('would', 165), ('men', 163),
 ('castruccio', 142), ('people', 116), ....

Now we excluded words such as `the` but we can improve further the list by looking at semantically similar words, such as plural and singular versions.

In:  'princes' in betterWords
Out: True
In:  betterWords.count("prince") + betterWords.count("princes")
Out: 281

Stemming

Above, in the list of words we have both prince and princes which are respectively the singular and plural version of the same word (the stem). The same would happen with verb conjugation (love and loving are considered different words but are actually inflections of the same verb).
Stemmer is the tool that reduces such inflectional forms into their stem, base or root form and NLTK has several of them (each with a different heuristic algorithm).

input1 = "List listed lists listing listings"
words1 = input1.lower().split(' ')
words1
Out: ['list', 'listed', 'lists', 'listing', 'listings']

And now we apply one of the NLTK stemmer, the Porter stemmer.

Porter stemming method is a rule-based algorithm introduced by Martin Porter in 1980 (paper: “an algorithm for suffix stripping”). The method is not always accurate but very fast.

porter = nltk.PorterStemmer()
[porter.stem(t) for t in words1]

['list', 'list', 'list', 'list', 'list']

As you see, all 5 different words have been reduced to the same stem and would be now the same lexical token.

stemmedWords = [porter.stem(w) for w in betterWords]
topBook = get_top_words(stemmedWords)
topBook[:20] # Output top 20 words
Out: 
[('one', 316), ('princ', 281), ('would', 165), ('men', 163),
 ('castruccio', 142), ('state', 137), ....

Now the word princ is counted 281 times, exactly like the sum of prince and princes.

A note here: Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time and often includes the removal of derivational affixes, e.g. Prince and princes become princ.

There are different stemming algorithms but they all follow the same rule-based principles.  The Lancaster stemming algorithm is newer – published in 1990 – but is even more aggressive than the Porter stemming algorithm.

A different flavour is the lemmatisation that we will see in one second, but first a note about stemming in other languages than English.

Stemming in other languages

Snowball is an improvement created by Porter: a language to create stemmers and have rules for many more languages than English. For example Italian:

from nltk.stem.snowball import SnowballStemmer
stemmerIT = SnowballStemmer("italian")

inputIT = "Io ho tre mele gialle, tu hai una mela gialla e due pere verdi"
wordsIT = inputIT.split(' ')

[stemmerIT.stem(w) for w in wordsIT]

['io','ho','tre','mel','gialle,','tu','hai','una',
'mel','giall','e','due','per','verd']

Lemma

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.
While a stemmer operates on a single word without knowledge of the context, a lemmatiser can take the context in consideration.

NLTK has also a built-in lemmatiser, so let’s see it in action:

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

words1
['list', 'listed', 'lists', 'listing', 'listings']

[lemmatizer.lemmatize(w, 'n') for w in words1] # n = nouns
['list', 'listed', 'list', 'listing', 'listing']

We tell the lemmatise that the words are nouns. In this case it considers the same lemma words such as list (singular noun) and lists (plural noun) but leave as they are the other words.

[lemmatizer.lemmatize(w, 'v') for w in words1] # v = verbs
['list', 'list', 'list', 'list', 'list']

We get a different result if we say that the words are verbs.
They have all the same lemma, in fact they could be all different inflections or conjugation of a verb.

The type of words that can be used are:
‘n’ = noun, ‘v’=verb, ‘a’=adjective, ‘r’=adverb

words2 = ['good', 'better']

[porter.stem(w) for w in words2]
['good', 'better']

[lemmatizer.lemmatize(w, 'a') for w in words2]
['good', 'good']

It works with different adjectives, it doesn’t look only at prefixes and suffixes.
You would wonder why stemmers are used, instead of always using lemmatisers: stemmers are much simpler, smaller and faster and for many applications good enough.

Now we lemmatise the book:

lemmatisedWords = [lemmatizer.lemmatize(w, 'n') for w in betterWords]
topBook = get_top_words(lemmatisedWords)
topBook[:20] # Output top 20 words
Out:
[('one', 316), ('prince', 281), ('would', 165), ('men', 163),
 ('castruccio', 142), ('state', 130), ('time', 129), ('people', 118),...

Yes, the lemma now is prince.
But note that we consider all words in the book as nouns, while actually a proper way would be to apply the correct type to each single word.

Part of speech (PoS)

In traditional grammar, a part of speech (abbreviated form: PoS or POS) is a category of words which have similar grammatical properties.

For example, an adjective (red, big, quiet, …) describe properties while a verb (throw, walk, have) describe actions or states.

Commonly listed parts of speech are noun, verb, adjective, adverb, pronoun, preposition, conjunction, interjection.

Essentially all these word classes exist within all Indo-European languages.

text1 = "Children shouldn't drink a sugary drink before bed."
tokensT1 = nltk.word_tokenize(text1)
nltk.pos_tag(tokensT1)

[('Children', 'NNP'), ('should', 'MD'), ("n't", 'RB'),('drink', 'VB'),
 ('a', 'DT'), ('sugary', 'JJ'), ('drink', 'NN'), ('before', 'IN'),
 ('bed', 'NN'), ('.', '.')]

The NLTK function `pos_tag()` will tag each token with the estimated PoS.
NLTK has 13 categories of PoS. You can check the acronym using the NLTK help function:

In:  nltk.help.upenn_tagset('RB')
Out: 
 RB: adverb

Which are the most common PoS in The Prince book?

tokensAndPos = nltk.pos_tag(bookTokens)
posList = [thePOS for (word, thePOS) in tokensAndPos]
fdistPos = nltk.FreqDist(posList)
fdistPos.most_common(5)
[('IN', 7218), ('NN', 5992), ('DT', 5374), (',', 4192), ('PRP', 3489)]

nltk.help.upenn_tagset('IN')
IN: preposition or conjunction, subordinating
 astride among whether out inside pro despite on by  ...

It’s not nouns (NN) but interections (IN) such as preposition or conjunction.

Extra note: Parsing the grammar structure

Words can be ambiguous and sometimes is not easy to understand which kind of POS is a word, for example in the sentence “visiting aunts can be a nuisance”, is visiting a verb or an adjective?
Tagging a PoS depends on the context, which can be ambiguous.

Making sense of a sentence is easier if it follows a well-defined grammatical structure, such as : subject + verb + object
NLTK allows to define a formal grammar which can then be used to parse a text. The NLTK ChartParser is a procedure for finding one or more trees (sentences have internal organisation that can be represented using a tree) corresponding to a grammatically well-formed sentence.

# Parsing sentence structure
text2 = nltk.word_tokenize("Alice loves Bob")
grammar = nltk.CFG.fromstring("""
 S -> NP VP
 VP -> V NP
 NP -> 'Alice' | 'Bob'
 V -> 'loves'
 """)

parser = nltk.ChartParser(grammar)
trees = parser.parse_all(text2)
for tree in trees:
  print(tree)
Out:
 (S (NP Alice) (VP (V loves) (NP Bob)))

This is a “toy grammar,” a small grammar that illustrate the key aspects of parsing. But there is an obvious question as to whether the approach can be scaled up to cover large corpora of natural languages. How hard would it be to construct such a set of productions by hand? In general, the answer is: very hard.
Nevertheless, there are efforts to develop broad-coverage grammars, such as weighted and probabilistic grammars.

The world outside NLTK

As a final note, NLTK was used here for educational purpose but you should be aware that has its own limitations.
NLTK is a solid library but it’s old and slow. Especially the NLTK’s lemmatisation functionality is slow enough that it will become the bottleneck in almost any application that will use it.

For industrial NLP application a very performance-minded Python library is SpaCy.io instead.
And for robust multi-lingual support there is polyglot that has a much wider language support of all the above.

Other tools exist in other computer languages such as Stanford CoreNLP and Apache OpenNLP, both in Java.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s