The Hello World of text processing

You can do many interesting things with text! It exists an entire area in Machine Learning called Natural Language Processing (NLP) which covers any kind of machine manipulation of natural human languages.

I want to show here a couple of pre-processing examples that are normally the base for more sophisticated models and algorithms applied in NLP.

A sort of “Hello world” program, the simplest working program but illustrative.
When you have a collection of text, it would be really useful to have at first a vocabulary with word counts.
Lots of useful things can come from it, from simplest metrics like term frequency to sophisticated classifiers.

This example is also available as Jupyter notebook on GitHub.


The basic atomic part of each text are the tokens. A token is the NLP name for a sequence of characters that we want to treat as a group.

For example we can consider group of characters separated by blank spaces, therefore forming words.

Tokens can be extracted by splitting the text using the Python function split().
Here is a simple example:

In [1]: sampleText = " The Elephant's 4 legs: this is THE Pub. \
  Hi, you, my super_friend! You can't believe what happened to \
  our * common friend *, the butcher. How are you?"
In [2]: textTokens = sampleText.split() # by default split by spaces
In [3]: textTokens
Out[3]: ['The', "Elephant's", '4', 'legs:', 'this', ...]
In [4]: print ("The sample text has {} tokens".format (len(textTokens)))
Out[4]: The sample text has 22 tokens

As you can see, tokens are words but also symbols like *. This is because we have split the string simply by using blank spaces. But you can pass other separators to the function split() such as commas.

Token frequency

I have the number of tokens for this sample text. Now I want to count the frequency for each token.

This can be done quickly by using the Python Counter from the module collections.
Counter is a sub-class of the dictionary data structure specialised for counting objects.

In [5]: from collections import Counter
In [6]: totalWords = Counter(textTokens); 
In [7]: totalWords
Out[7]: Counter({'*': 1,'*,': 1, '4': 1, "Elephant's": 1,
                 'Pub.': 1,'THE': 1, 'The': 1, 'the': 1,
                 'You': 1, 'you,': 1, 'you?': 1, ... })

Counter produces a dictionary with each token as the keys and its count as the value.
We see that there are a number of problems:

  • some word (like The/THE or You/you) is in two different tokens because of capital vs. small letter
  • some token contains punctuation marks which makes the same word counted twice (like you/you?)
  • some token consists of symbols (like *)

Luckily we can quickly fix them.

Remove capital letters

This is very easy to do in Python using the lower() function:

In [8]: loweredText = sampleText.lower()
In [9]: loweredText
Out[9]: " the elephant's 4 legs: this is the pub. hi, you, \
my super_friend! you can't believe what happened to our *  \
common friend *, the butcher. how are you?"
In [10]: textTokens = loweredText.split()
In [11]: totalWords = Counter(textTokens);  totalWords
Out[11]: Counter({'*': 1, '*,': 1, '4': 1, "elephant's": 1, 
                  'the': 3, 'you': 1, 'you,': 1,'you?': 1, ... })

Now the token “the” is counted correctly 3 times !

But other words like “you” are still wrongly counted because of the punctuation such as comma or question mark.

Remove punctuation and trailing spaces

Removing the extra spaces is very easy, by using the string function strip():

In [12]: strippedText = loweredText.strip(); strippedText
Out[12]: "the elephant's 4 legs: this is the pub. hi, you, \
my super_friend! you can't believe what happened to our *  \
common friend *, the butcher. how are you?"

Note that the initial blank space at the beginning is now gone.

To remove punctuation we can use regular expressions.
Regular expressions (regex in short) are a very powerful tool to match patterns in strings; basically you apply a sequence of meta-characters which have special meanings as defined by the regex language to the string you want to transform.

The way in regex to match specific characters is to list them inside square brackets. For example [abc] will match only a single a,b or c letter and nothing else.
A shorthand for matching sequential characters is to use the dash, for example [a-z] will match any single lowercase letter and [0-9] any single digit character.

To exclude characters from the matching we use the ^ (hat) symbol, for example [^abc] will match any single character except the letters a, b or c.

Finally there is also a special symbol for the whitespaces as they are so ubiquitous in text. Whitespaces are blank spaces, tabs, newlines, carriage returns. The symbol \s will match any of them.

The Python function re.sub() from the re module takes as input a pattern to match, a string that will replace it and a starting string,  and returns the starting string transformed with the matches replaced by the given substring using the given pattern.
For example re.sub(r'[s]’, ‘z’, “whatsapp”) will return “whatzapp”.

So one way to remove the punctuation is to replace any character that is NOT a letter, NOT a number NOR a whitespace with an empty substring:

In [13]: import re
In [14]: processedText = re.sub(r'[^a-z0-9\s]', '', strippedText) 
In [15]: processedText
Out[15]: 'the elephants 4 legs this is the pub hi you my superfriend you cant \
believe what happened to our common friend the butcher how are you'

Another useful symbol is \w which match ANY alphanumeric character or \W
which matches any NON alphanumeric character.
So, an alternative way could be:

In [14]: processedText = re.sub(r'[^\s\w]', '', strippedText) 
In [15]: processedText
Out[15]: 'the elephants 4 legs this is the pub hi you my superfriend you cant \
believe what happened to our common friend the butcher how are you'

Regex are really powerful and used extensively in text processing. You can find many tutorials on the web.

In [16]: textTokens = processedText.split()
In [17]: totalWords = Counter(textTokens); totalWords
Out[17]: Counter({'4': 1,'the': 3, 'you': 3, ... })

Finally, now we have the correct count for the words ‘the’ and ‘you’ !

Sorting the tokens

This is not really necessary, but a collection can be sorted easily:

In [18]: print (totalWords.items())    # you can access each item
Out[18]: dict_items([('4', 1), ('you', 3), ('the', 3), ... ])
In [19]: sorted(totalWords.items(), key=lambda x:x[1], reverse=True)
Out[19]: [('the', 3), ('you', 3), ('cant', 1), ... ('4', 1)]

The Python function sorted() takes care of the sorting; it takes as input the data structure to be sorted (totalWords.items() in this case, which is a list of tuples), a sorting key, e.g. what should I use to sort and finally if the sorting order is ascendent or descendent.
In this case the sorting key is an anonymous function (like a “one-shot” function without a name) called lambda in Python, which takes each item as input (the “x”) and returns the second element inside it (which is the token count).
For example, for the second totalWord item which is x=(‘you’,3) then x[0] would be the first tuple element: ‘you’ and  x[1] would be the second tuple element: 3, which is the wished key for sorting.

Lambda functions are extensively used in Python and will talk more about them in future.

Let’s put these results into functions we can re-use

def tokenise(text):
    return text.split() # split by space; return a list

def removePunctuation(text):
     return re.sub(r'([^\s\w_]|_)+', '', text.strip()) 

def preprocessText(text, lowercase=True):
    if lowercase:
        processedText = removePunctuation(text.lower())
        processedText = removePunctuation(text)
    return tokenise(processedText)

def getMostCommonWords(tokens, n=10):
    wordsCount = Counter(tokens) # count the occurrences
    return wordsCount.most_common()[:n]

Process a text file (tokenisation)

Python makes working with files pretty simple.
We can use one of the books from the Project Gutenberg, which is a library of digital books freely available to be downloaded.
In this case I downloaded “The Prince”, a book by Machiavelli.

fileName = "theprince.txt"

First of all, we need to open the file using the Python function open() that returns a file object.
We can then read the file content using the file method read():

f = open(fileName)
    theText =  # this is a giant String
    f.close()   # we should always close the file once finished

print ("*** Analysing text: {}".format(fileName)) 
print ("The text is {} chars long".format (len(theText)))

tokens = preprocessText(theText)

print ("The text has {} tokens".format (len(tokens)))

 *** Analysing text: theprince.txt
 The text is 300814 chars long
 The text has 52536 tokens

This is an easy way to read the entire text but:

  1. It’s quite memory inefficient
  2. It’s slower than processing data as it is read, because it defers any processing done on read data until after all data has been read into memory, rather than processing as data is read.

A better way when handling with large or complex files is to read them line by line.

We can do this with a simple loop which will go through each line.
Note that the block keyword with will automatically close the file at the end of the block.

textTokens = [] # tokens will be added here
lines = 0       # just a counter

with open(fileName) as f:
    for line in f:
          # I know that the very first line is the book title
    if lines == 0:
        print ("*** Title: {}".format(line))
          # every line gets processed
    lineTokens = preprocessText(line) 
          # append the tokens to my list
    lines += 1 # finally increment the counter

print ("The text has {} lines".format (lines)) 
print ("The text has {} tokens".format (len(textTokens)))

*** Title: The Project Gutenberg EBook of The Prince, by Nicolo Machiavelli
The text has 5064 lines
The text has 52536 tokens

In [20]: textTokens[:10]  # display the first 10 tokens
Out[20]: ['the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'prince',
          'by', 'nicolo', 'machiavelli']

Unique tokens

Another useful data structure in Python is the set which is an unordered collection of distinct objects.
Arranging the tokens in a set means that they will be put only once,  resulting into a more compact data collection, useful to see how many distinct tokens are in a text or to see if a specific token is in the text or not.

In [21]: uniqueTokens = set(textTokens)

In [22]: print ("The text has {} unique tokens".format (len(uniqueTokens)))
In [23]: print ("Lexical diversity: each token in average is repeated 
                {} times".format(len(textTokens) / len(uniqueTokens)))

Out[22]: The text has 5630 unique tokens
Out[23]: Lexical diversity: each token in average is repeated 9.33 times
In [24]: sorted(uniqueTokens)[200:205]  # can be sorted alphabetically
Out[24]: ['accumulate', 'accuse', 'accused', 'accustomed', 'accustoms']

In [25]: 'accuse' in uniqueTokens
Out[25]: True

In [26]: 'phone' in uniqueTokens
Out[26]: False

Remove “stop words”

In [27]: getMostCommonWords(textTokens, 5)
[('the', 3109), ('to', 2107), ('and', 1935), ('of', 1802), ('in', 993)]

As you can see the most common words in this text are not really meaningful but we can remove in advance all the generic words like articles and conjunctions to get a better idea of the text tone.

Stop words are those common words that do not contribute much to the content or meaning of a document (e.g., for English language these could be “the”, “a”, “is”, “to”, etc.).
There are several lists of stop words available for many languages. For the English language you can find several files in the above Wikipedia article.

The way to filter these words would be to read the stop words from the file and put them into a Python set. Then you can check for each of your text tokens if it is also in the set.

f = open("stopwords.txt")
stopWordsText = # splitlines is used to remove newlines
stopWords = set(stopWordsText)   # create a set 
    # keep only tokens which are NOT in the stopwords set
betterTokens = [token for token in textTokens if token not in stopWords]
In [28]: betterTokens[:10]  # display the first 10 tokens now
Out[28]: ['project', 'gutenberg', 'ebook', 'prince', 'nicolo',
          'machiavelli', 'ebook', 'use', 'anyone', 'anywhere']

As you can see, now the tokens have been stripped by stop words like “The”.
Let’s see which are now the most common tokens in “The Prince”:

In [29]: getMostCommonWords(betterTokens)
Out[29]: [('prince', 222), ('men', 161), ('castruccio', 136),
       ('people', 115), ('many', 101), ('others', 96), ('time', 93),
       ('great', 89), ('duke', 88), ('project', 87)]

Generate a words cloud from a text

Here is a small example: word cloud is an image composed of words used in a particular text in which the size of each word indicates its frequency.

There are many online services where you can upload a list of words and get back a word cloud. I used this one : it is a javascript web service and accepts a text file with token and its occurrence per line.
So, I just need to tokenise a novel and count its tokens as we have seen above.
To make it more interesting we take as text a novel accessed directly from the web (the Project Gutenberg site).
The module urllib.request provides functions to open a file object given a URL.

from urllib.request import urlopen

def getWordCloud(filename):
    textTokens = [] # tokens will be added here
    lines = 0       # line counter
    path = ""      # web page URL
    url = path + filename + "/" + filename + ".txt"  # novel URL
    f = urlopen(url)
     for line in f: 
            # every line gets processed
            # texts are encoded in Unicode (UTF-8)
         lineTokens = preprocessText(line.decode('utf-8')) 
            # append the tokens to my list
         lines += 1
        # now remove the stop words
    fs = open("stopwords.txt")
    stopWords = set(
    betterTokens = [token for token in textTokens if token not in stopWords]
    wordsCount = Counter(betterTokens) # count the occurrences
        # put each token and its occurrence in a file
        # each line is: "occurrence" "token"
    with open("wordcloud_"+filename+".txt", 'a') as fw:
        for line in wordsCount.most_common():
            fw.write(str(line[1]) + ' ' + line[0] + '\n')

Now we generate the two clouds.
I use two totally different novels but from the same period: Pride and Prejudice by Jane Austen (filename 1342 in Gutenberg) and A sentimental Journey through France and Italy by Laurence Sterne (filename 804 in Gutenberg)

In [30]: getWordCloud("1342") # Jane Austen


Unsurprisingly, the most common words are Mr, Elizabeth and Darcy.

In [31]: getWordCloud("804") # Laurence Sterne

wordcloud-3Here “said” is the most common word.

That’s all.
This example is also available as Jupyter notebook on GitHub.