Parts-of-Speech (POS) Tagging

Part-of-speech refers to the category of words (Noun, Verb, Adjective…) in the language.
The part-of-speech (POS) tagging is the process of assigning a part-of-speech tag to each word in an input text. Tagging is difficult because some words can represent more than one part of speech at different times. They are Ambiguous. Let’s look at the following example:

  • The whole team played well. [adverb]
  • You are doing well for yourself. [adjective]
  • Well, this assignment took me forever to complete. [interjection]
  • The well is dry. [noun]
  • Tears were beginning to well in her eyes. [verb]

POS are useful because you can use them to make assumptions about semantics. This would be critically important in search queries. Identifying the proper noun, the organization, the stock symbol, or anything similar would greatly improve everything ranging from speech recognition to search. They’re used for identifying named entities too.

Here is an example ‘tag-set’ or Part of Speech designation describing the two or three letter tag and their meaning.

The goal

A simple tagger can be built using the probability that a word belongs to a specific tag.
To find out the probabilities we can use an existing dataset of sentences (articles from the WSJ) whose words have been properly tagged. Unambiguous words will have a probability near 100% of belonging to a tag, while for ambiguous words, it depends on how often they are used according to each tag.

Data Sources

This notebook will use two tagged data sets collected from the Wall Street Journal (WSJ):

  • One data set (WSJ-2_21.pos) will be used for training.
  • The other (WSJ-24.pos) for testing.
  • The tagged training data has been preprocessed to form a vocabulary (hmm_vocab.txt).
    The words in the vocabulary are words from the training set that were used two or more times.
    The vocabulary is augmented with a set of ‘unknown word tokens’.
   
# load in the training corpus
with open("../datasets/WSJ_02-21.pos", 'r') as f:
    training_corpus = f.readlines()  # list

print("A few items of the training corpus list: ")
print(training_corpus[0:5])

    A few items of the training corpus list: 
    ['In\tIN\n', 'an\tDT\n', 'Oct.\tNNP\n', '19\tCD\n', 'review\tNN\n']
len(training_corpus)

    989860

As you can see, the training_corpus is a list with all words extracted from English articles, together with their POS tag.
Almost one million of them!

# load in the test corpus
with open("../datasets/WSJ_24.pos", 'r') as f:
    testing_corpus = f.readlines()  # list

print("A sample of the testing corpus")
print(testing_corpus[0:10])

    A sample of the testing corpus
    ['The\tDT\n', 'economy\tNN\n', "'s\tPOS\n", 'temperature\tNN\n', 'will\tMD\n', 'be\tVB\n', 'taken\tVBN\n', 'from\tIN\n', 'several\tJJ\n', 'vantage\tNN\n']

len(testing_corpus)

    34199

The Testing Corpus is similar, just a subset of the Training one.
It will be used at the end for calculating the model’s accuracy.

# read the vocabulary data, split by each line of text, and save the list
with open("../datasets/hmm_vocab.txt", 'r') as f:
    voc_l = f.read().split('\n')  # list

print("A few items of the vocabulary list")
print(voc_l[0:25])
print()
print("A few items at the end of the vocabulary list")
print(voc_l[-25:])

    A few items of the vocabulary list
    ['!', '#', '$', '%', '&', "'", "''", "'40s", "'60s", "'70s", "'80s", "'86", "'90s", "'N", "'S", "'d", "'em", "'ll", "'m", "'n'", "'re", "'s", "'til", "'ve", '(']
    
    A few items at the end of the vocabulary list
    ['yields', 'you', 'young', 'younger', 'youngest', 'youngsters', 'your', 'yourself', 'youth', 'youthful', 'yuppie', 'yuppies', 'zero', 'zero-coupon', 'zeroing', 'zeros', 'zinc', 'zip', 'zombie', 'zone', 'zones', 'zoning', '{', '}', '']

# vocab: dictionary that has the index of the corresponding words
vocabulary = {} 

# Get the index of the corresponding words. 
for i, word in enumerate(sorted(voc_l)): 
    vocabulary[word] = i       
    


len(vocabulary)

    23777

The vocabulary is an indexed list of words; almost 24K of them.
The unique words have been extracted from the training corpus.

The first 20 words:

print("Vocabulary dictionary: key is the word, value is a unique integer")
cnt = 0
for k,v in vocabulary.items():
    print(f"{k}:{v}")
    cnt += 1
    if cnt > 20:
        break

    Vocabulary dictionary: key is the word, value is a unique integer
    :0
    !:1
    #:2
    $:3
    %:4
    &:5
    ':6
    '':7
    '40s:8
    '60s:9
    '70s:10
    '80s:11
    '86:12
    '90s:13
    'N:14
    'S:15
    'd:16
    'em:17
    'll:18
    'm:19
    'n':20

Testing words

The testing set (WSJ-24.pos) has also been preprocessed to remove the tags to form test_words.txt. This is read in to create y, our target for the accuracy measurement, later on.

with open("../datasets/test.words", 'r') as f:
    y = f.readlines()  # list

len(y)

    34199
print(y[0:20])

    ['The\n', 'economy\n', "'s\n", 'temperature\n', 'will\n', 'be\n', 'taken\n', 'from\n', 'several\n', 'vantage\n', 'points\n', 'this\n', 'week\n', ',\n', 'with\n', 'readings\n', 'on\n', 'trade\n', ',\n', 'output\n']

Parts-of-speech tagging

We will start with the simplest possible parts-of-speech tagger and we will build up to the state of the art.

Training

In this section, we will find the words that are not ambiguous.

  • For example, the word is is a verb and it is not ambiguous.
  • In the WSJ corpus, 86% of the token are unambiguous (meaning they have only one tag)
  • About 14% are ambiguous (meaning that they have more than one tag)
    Before we start predicting the tags of each word, we will need to compute a few dictionaries that will help to generate the tables.

Emission counts

The first dictionary to compute (and the one that will be used initially) is the emissionCounts dictionary. This dictionary will be used to compute:

P(w_i | t_i)

In other words, we will use it to compute the probability of a word given its tag.

In order to compute this probability, we will create an emissionCounts dictionary where

  • The keys are (tag, word)
  • The values are the number of times that pair showed up in the training set.

Transition counts

The second dictionary is the transitionCounts dictionary which computes the number of times each tag happened next to another tag.

This dictionary will be used to compute:
P(t_i | t_{i-1})

This is the probability of a tag at position i given the tag at position i-1.

In order to compute it the transitionCounts dictionary has:

  • the keys are (prev_tag, tag)
  • the values are the number of times those two tags appeared in that order.

To calculate the transition probabilities, we actually only use the parts of speech tags from the training corpus. So to calculate the probability of a speech tag transitioning to another one, we first have to count the occurrences of that tag combination in the corpus, which is N. The number of all tag pairs starting with the first tag is M, for this corpus at least. So N out of M tag sequences in the training corpus starts with the first parts of speech tag. In other words, that is the transition probability for the second tag following the first tag.

Tag counts

The last dictionary we will compute is the tagCounts dictionary:

  • The key is the tag
  • The value is the number of times each tag appeared.

Create the helper dictionaries

We use a helper function that takes in the training_corpus and returns the three dictionaries mentioned above transitionCounts, emissionCounts, and tagCounts.

  • emissionCounts: maps (tag, word) to the number of times it happened.
  • transitionCounts: maps (prev_tag, tag) to the number of times it has appeared.
  • tagCounts: maps (tag) to the number of times it has occured.

Implementation note: This routine utilises defaultdict, which is a subclass of dict.

  • A standard Python dictionary throws a KeyError if you try to access an item with a key that is not currently in the dictionary.
  • In contrast, the defaultdict will create an item of the type of the argument, in this case an integer with the default value of 0.
  • See defaultdict.

A POS tagger will necessarily encounter words that are not in its datasets.

  • To improve accuracy, these words are further analyzed during preprocessing to extract available hints as to their appropriate tag.
  • For example, the suffix ‘ize’ is a hint that the word is a verb, as in ‘final-ize’ or ‘character-ize’.
  • A set of unknown-tokens, such as ‘–unk-verb–‘ or ‘–unk-noun–‘ will replace the unknown words in both the training and test corpus and will appear in the emission, transmission and tag data structures.
import string

# Punctuation characters
punct = set(string.punctuation)

# Morphology rules used to assign unknown word tokens
noun_suffix = ["action", "age", "ance", "cy", "dom", "ee", "ence", "er", "hood", "ion", "ism", "ist", "ity", "ling", "ment", "ness", "or", "ry", "scape", "ship", "ty"]
verb_suffix = ["ate", "ify", "ise", "ize"]
adj_suffix = ["able", "ese", "ful", "i", "ian", "ible", "ic", "ish", "ive", "less", "ly", "ous"]
adv_suffix = ["ward", "wards", "wise"]



def assign_unk(tok):
    """
    Assign unknown word tokens
    """
    # Digits
    if any(char.isdigit() for char in tok):
        return "--unk_digit--"

    # Punctuation
    elif any(char in punct for char in tok):
        return "--unk_punct--"

    # Upper-case
    elif any(char.isupper() for char in tok):
        return "--unk_upper--"

    # Nouns
    elif any(tok.endswith(suffix) for suffix in noun_suffix):
        return "--unk_noun--"

    # Verbs
    elif any(tok.endswith(suffix) for suffix in verb_suffix):
        return "--unk_verb--"

    # Adjectives
    elif any(tok.endswith(suffix) for suffix in adj_suffix):
        return "--unk_adj--"

    # Adverbs
    elif any(tok.endswith(suffix) for suffix in adv_suffix):
        return "--unk_adv--"

    return "--unk--"

# Helper: substitues word not in the vocabulary with "unknown"
def get_word_tag(line, vocab): 
    if not line.split():
        word = "--n--"
        tag = "--s--"
        return word, tag
    else:
        word, tag = line.split()
        if word not in vocab: 
            # Handle unknown words
            word = assign_unk(word)
        return word, tag
    return None 

from collections import defaultdict

def create_dictionaries(corpus, vocab):
    """
    Input: 
        corpus: a corpus where each line has a word followed by its tag.
        vocab: a dictionary where keys are words in vocabulary and value is an index
    Output: 
        emission_counts: a dictionary where the keys are (tag, word) and the values are the counts
        transition_counts: a dictionary where the keys are (prev_tag, tag) and the values are the counts
        tag_counts: a dictionary where the keys are the tags and the values are the counts
    """
    
    # initialize the dictionaries using defaultdict
    emission_counts = defaultdict(int)
    transition_counts = defaultdict(int)
    tag_counts = defaultdict(int)
    
    # Initialize "prev_tag" (previous tag) with the start state, denoted by '--s--'
    prev_tag = '--s--' 
    
    # use 'i' to track the line number in the corpus
    i = 0 
    
    # Each item in the training corpus contains a word and its POS tag
    # Go through each word and its tag in the training corpus
    for word_tag in corpus:
        
        # Increment the word_tag count
        i += 1
        
        # Every 50,000 words, print the word count, just to see the progress
        if i % 50000 == 0:
            print(f"word count = {i}")
            
        # get the word and tag using the get_word_tag helper function 
        word, tag = get_word_tag(word_tag, vocab) 
        
        # Increment the transition count for the previous word and tag
        transition_counts[(prev_tag, tag)] += 1
        
        # Increment the emission count for the tag and word
        emission_counts[(tag, word)] += 1

        # Increment the tag count
        tag_counts[tag] += 1

        # Set the previous tag to this tag (for the next iteration of the loop)
        prev_tag = tag
        
        
    return emission_counts, transition_counts, tag_counts

emissionCounts, transitionCounts, tagCounts = create_dictionaries(training_corpus, vocabulary)

    word count = 50000
    word count = 100000
    word count = 150000
    word count = 200000
    word count = 250000
    word count = 300000
    word count = 350000
    word count = 400000
    word count = 450000
    word count = 500000
    word count = 550000
    word count = 600000
    word count = 650000
    word count = 700000
    word count = 750000
    word count = 800000
    word count = 850000
    word count = 900000
    word count = 950000

# get all the POS states
tags = sorted(tagCounts.keys())
print(f"Number of POS tags: {len(tags)}")
print("View these POS tags")
print(tags)

    Number of POS tags: 46
    View these POS tags
    ['#', '$', "''", '(', ')', ',', '--s--', '.', ':', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB', '``']

We got mapped a total of 46 POS tags which is great: every tag mentioned in the Penn Databank is here.

The ‘tags’ are the Parts-of-speech designations found in the training data.

  • “NN” is noun, singular,
  • ‘NNS’ is noun, plural.
  • In addition, there are helpful tags like ‘–s–‘ which indicate a start of a sentence.
  • You can get a more complete description at Penn Treebank II tag set.
print("transition examples: ")
for ex in list(transitionCounts.items())[:3]:
    print(ex)
print()

    transition examples: 
    (('--s--', 'IN'), 5050)
    (('IN', 'DT'), 32364)
    (('DT', 'NNP'), 9044)
    

The transition dictionary shows how often we go from a tag to another, for example from DT (a determiner, an article such as ‘the’, or ‘a’) to a NNP (proper noun) is 9044 times

print("emission examples: ")
for ex in list(emissionCounts.items())[200:203]:
    print (ex)
print()


print("ambiguous word example: ")
for tup,cnt in emissionCounts.items():
    if tup[1] == 'back': print (tup, cnt) 

    emission examples: 
    (('DT', 'any'), 721)
    (('NN', 'decrease'), 7)
    (('NN', 'insider-trading'), 5)
    
    ambiguous word example: 
    ('RB', 'back') 304
    ('VB', 'back') 20
    ('RP', 'back') 84
    ('JJ', 'back') 25
    ('NN', 'back') 29
    ('VBP', 'back') 4


EmissionCounts dictionary is the most interesting: we can see that the word ‘any’ has been labelled as a DT (determiner) 721 times.
We can search all the tags used for a specific word (for example ‘back’) and see that most of the times was tagged as RB (adverb) but not always, it can be also a verb! It’s ambiguous.

Let’s parse some words and sentences example, using the new dictionaries.
We create two helper functions first:

# we assume that we have the emission dictionary and the tags available

def getWordPOS(word):
    """
    Input: 
        word: the word to be tagged.
    Output: 
        pos_final: the most probable tag for this word; --unknown-- if the word is not in the dictionaries
    """
    count_final = 0
    pos_final = ''
        
    for pos in tags:
                        
                # define the key as the tuple containing the POS and word
        key = (pos,word)

                # check if the (pos, word) key exists in the emissionCounts dictionary
        if key in emissionCounts: # emissionCounts is global

                # get the emission count of the (pos,word) tuple 
            count = emissionCounts[key]

                    # keep track of the POS with the largest count
            if count > count_final:

                        # update the final count (largest count)
                count_final = count

                        # update the final POS
                pos_final = pos
                
    if count_final == 0:
        return "- Unknown -"
    else:
        return pos_final

getWordPOS('dog')

    'NN'
getWordPOS('dogs')

    'NNS'

So dog is a singular noun (NN) and dogs is a plural noun (NNS).
So far so good.
Let’s parse an entire sentence and we define another helper function for it:

def getSentencePOS(sentence):
    """
    Input: 
        sentence: the sentence to be tagged. String.
    Output: 
        a list of tuples: each word with its most probable tag
    """
    words = sentence.split(" ") # split into words (a list)
    sentencePOS = [] # second list for the tags
    
    for word in words:
        wordPOS = getWordPOS(word) # find the word's POS
        sentencePOS.append(wordPOS)
                
            # return a list of tuples: (word, POS)
    return list(zip(words, sentencePOS)) # combine the two lists

print(getSentencePOS("I have a black cat"))

    [('I', 'PRP'), ('have', 'VBP'), ('a', 'DT'), ('black', 'JJ'), ('cat', '- Unknown -')]

What?!
Cat is an unknown word? But dog could be identified?! This is racist…

print(getSentencePOS("I have not a big dog"))

    [('I', 'PRP'), ('have', 'VBP'), ('not', 'RB'), ('a', 'DT'), ('big', 'JJ'), ('dog', 'NN')]

Ok, back to dogs. Here you can see that ‘I’ is identified as a Pronoun (PRP), ‘have’ as a Verb conjugated (VBP) and ‘big’ or ‘black’ are an adjective (JJ). Everything correct.

print(getSentencePOS("A lion is always faster than the turtle"))

    [('A', 'DT'), ('lion', 'NN'), ('is', 'VBZ'), ('always', 'RB'), ('faster', 'RBR'), ('than', 'IN'), ('the', 'DT'), ('turtle', 'NN')]

‘A’ and ‘The’ are articles or determiners (DT), ‘is’ a verb 3rd person (VBZ), ‘always’ an adverb (RB) and ‘faster’ is a comparative (RBR). Still everything correct

print(getSentencePOS("I eat pizza and pasta but I do not eat meat"))

    [('I', 'PRP'), ('eat', 'VB'), ('pizza', 'NN'), ('and', 'CC'), ('pasta', 'NN'), ('but', 'CC'), ('I', 'PRP'), ('do', 'VBP'), ('not', 'RB'), ('eat', 'VB'), ('meat', 'NN')]

Here ‘eat’ is a Verb (VB) while ‘and’, ‘but’ are conjunctions (CC). Good

print(getSentencePOS("I work in Shanghai"))

    [('I', 'PRP'), ('work', 'NN'), ('in', 'IN'), ('Shanghai', 'NNP')]

And here you can see that is failing: ‘work’ is not only a Noun, in this case is a Verb.
This is a case where the word is ambiguous and the parser is not enough sophisticated, as it just looks at the absolute probabilities.
We will build a better one in the next notebook using the transition dictionary that gives an idea about the context of the word.
Stay tuned but now let’s see how far this simple parser achieves on the testing dataset.

Testing

Now we will test the accuracy of the parts-of-speech tagger using the emissionCounts dictionary.

The testing set (WSJ-24.pos):

  • Contains both the test text and the true tag.
  • The testing set has also been preprocessed to remove the tags to form test_words.txt.
  • Now will be further processed to identify the end of sentences and handle words not in the vocabulary.
  • This forms the list y_prepr, the preprocessed text used to test our POS taggers.

Now:

  • Given a preprocessed test corpus y_prepr, we will assign a parts-of-speech tag to every word in that corpus.
  • Using the original tagged testing corpus, we will then compute what percent of the tags got correct.
orig = []
y_prepr = []

    # we already read the words from testing dataset into 'y' for cnt, word in enumerate(y):


            # End of sentence
    if not word.split():
        orig.append(word.strip())
        word = "--n--"
        y_prepr.append(word)
        continue

            # Handle unknown words
    elif word.strip() not in vocabulary:
        orig.append(word.strip())
        word = assign_unk(word)
        y_prepr.append(word)
        continue

    else:
        orig.append(word.strip())
        y_prepr.append(word.strip())

assert(len(orig) == len(y)) # just to be sure
assert(len(y_prepr) == len(y))

#corpus without tags, preprocessed

print('The length of the preprocessed testing corpus: ', len(y_prepr))
print('This is a sample of the testing corpus: ')
print(y_prepr[0:20])

    The length of the preprocessed testing corpus:  34199
    This is a sample of the testing corpus: 
    ['The', 'economy', "'s", 'temperature', 'will', 'be', 'taken', 'from', 'several', '--unk--', 'points', 'this', 'week', ',', 'with', 'readings', 'on', 'trade', ',', 'output']

Prediction

Let’s compute the accuracy of the model:

  • assign a part of speech to a word (the most frequent POS for that word in the training set).
  • then evaluate how well this approach works. Each time we predict based on the most frequent POS for the given word, we check whether the actual POS of that word is the same. If so, the prediction was correct!
  • calculate the accuracy as the number of correct predictions divided by the total number of words for which we predicted the POS tag.
# Initialize the number of correct predictions to zero
num_correct = 0

# Get the (tag, word) tuples, stored as a set
all_words = set(emissionCounts.keys())

# Get the number of (word, POS) tuples in the corpus 'y'
total = len(testing_corpus)

for word, y_tup in zip(y_prepr, testing_corpus):
    # Split the (word, POS) string into a list of two items
    y_tup_l = y_tup.split()
    
    # Verify that y_tup contain both word and POS
    if len(y_tup_l) == 2:
        
        # Set the true POS label for this word
        true_label = y_tup_l[1]
    else:
        # If the y_tup didn't contain word and POS, go to next word
        continue
        
    count_final = 0
    pos_final = ''
    
    if word in vocabulary:
        for pos in tags:
            
            # define the key as the tuple containing the POS and word
            key = (pos,word)
            
            # check if the (pos, word) key exists in the emissionCounts dictionary
            if key in emissionCounts: 
                # get the emission count of the (pos,word) tuple
                count = emissionCounts[key]
                
                # keep track of the POS with the largest count
                if count > count_final:
                    # update the final count (largest count)
                    count_final = count
                    
                    # update the final POS
                    pos_final = pos
                    
        # If the final POS (with the largest count) matches the true POS:
        if pos_final == true_label:
            # Update the number of correct predictions
            num_correct += 1
            
accuracy = num_correct / total

print(f"Accuracy of prediction using predict_pos is {accuracy:.4f}")


    Accuracy of prediction using predict_pos is 0.8889

88.9% is really good for this warm up exercise. With hidden markov models, we should be able to get 95% accuracy.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s