NLP3o: A web app for simple text analysis

We have seen earlier an introduction of few NLP (Natural Language Processing) topics, specifically how to tokenise a text, remove the punctuation and its stop words (the most common words such as the conjunctions) and how to find the top used words.

Now we can see – as an example – how to put all these inside a simple web app.

The entire source code is available on GitHub, so you can better follow along; and you can see it in action in a Heroku dyno.

Screen Shot 2017-07-29 at 14.55.08 Continue reading “NLP3o: A web app for simple text analysis”

[Link] Algorithms literature

From the Social Media Collective, part of the Microsoft Research labs, an interesting and comprehensive list of studies about algorithms as social concern.

Our interest in assembling this list was to catalog the emergence of “algorithms” as objects of interest for disciplines beyond mathematics, computer science, and software engineering.

They also try to categorise the studies and add an intriguing timeline visualisation (that shows how much interest are sparking the algorithms in this time):

timeline

Chatbots

A brief history of chatbots

A chatbot is a computer program which conducts a conversation via auditory or textual methods.

The term “ChatterBot” was originally coined by Michael Mauldin in 1994 to describe these conversational programs but they are much older, the first one being ELIZA by Joseph Weizenbaum of MIT in 1966.

Leaving the academic world, conversational agents have been typically used in dialog systems including customer service or information acquisition.
Many large companies started to use automated online assistants instead of call centres with humans, to provide a first point of contact.
Most of these systems ask you to push a digit corresponding to what you want or say what you’re calling about and scan for keywords within the vocal input, then pull a reply with the most matching answer from a database.
These systems are based on simple logic trees (SLT).

An SLT agent relies therefore on a fixed decision tree to gather information and redirect the user.
For example, an insurance bot may ask several questions to determine which policy is ideal for you. Or an airline bot could ask you the departure city, the destination and a time. Or a device diagnostics bot could guide you through the hardware components and tests to find out the issue.
It´s a finite-state dialog, the system completely controls the conversation.
If your input match what the bot has anticipated, the experience will be seamless. However, if it stray from the answers programmed and stored in the bot database, you might hit a dead-end. Back to a human to complete the interaction…

These were efficient and simple systems but not really effective.
In normal human-to-human dialogue the initiative shifts back and forth between the participants, it’s not system-only.

A very recent trend is to use natural language processing (NLP) and Machine Learning (ML) algorithms such as you see in smartphone-based personal assistants (Apple Siri, Microsoft Cortana  or Google Now) or when talking to your car or your home automation system (Amazon Alexa) or in some messaging platforms.

Continue reading “Chatbots”

[Link] The five keys to a successful Google team

An interesting article from the NY Times  about a 2012 Google initiative— code-named Project Aristotle — to study hundreds of Google’s teams and figure out why some stumbled while others soared.

The article itself is a longer and more narrative recount of what has been posted earlier by one of the lead researchers, Rozovsky. Following is a summary and highlights, see the article for the entire text.

After months arranging and looking at the data, Rozovsky and her colleagues were not able to find any patterns or even an evidence that the composition of a team made any difference.

We were dead wrong. Who is on a team matters less than how the team members interact, structure their work, and view their contributions.

As they struggled to figure out what made a team successful,  they looked at what are known as ‘‘group norms’’: the traditions, behavioural standards and unwritten rules that govern how we function when we gather.

Team members may behave in certain ways as individuals but when they gather, the group’s norms typically override individual proclivities and encourage deference to the team.

Project Aristotle’s researchers began searching for instances when team members described a particular behaviour as an ‘‘unwritten rule’’ or when they explained certain things as part of the ‘‘team’s culture’’ and which norms mattered most.

There were other behaviors that seemed important as well — like making sure teams had clear goals and creating a culture of dependability. But Google’s data indicated that psychological safety, more than anything else, was critical to making a team work.

Psychological safety is ‘‘a sense of confidence that the team will not embarrass, reject or punish someone for speaking up,’’ Edmondson wrote in a study published in 1999. ‘‘It describes a team climate characterised by interpersonal trust and mutual respect in which people are comfortable being themselves.’’

However, establishing psychological safety is, by its very nature, somewhat messy and difficult to implement.

What Project Aristotle has taught people within Google is that no one wants to put on a ‘‘work face’’ when they get to the office. No one wants to leave part of their personality and inner life at home. We can’t be focused just on efficiency. Rather, when we start the morning by collaborating with a team of engineers and then send emails to our marketing colleagues and then jump on a conference call, we want to know that those people really hear us. We want to know that work is more than just labor.

Finally, Google created a 10-minute exercise that summarises how the team is doing on five key dynamics:

  1. Psychological safety: Can we take risks on this team without feeling insecure or embarrassed?
  2. Dependability: Can we count on each other to do high quality work on time?
  3. Structure & clarity: Are goals, roles, and execution plans on our team clear?
  4. Meaning of work: Are we working on something that is personally important for each of us?
  5. Impact of work: Do we fundamentally believe that the work we’re doing matters?

team

 

The Hello World of text processing

You can do many interesting things with text! It exists an entire area in Machine Learning called Natural Language Processing (NLP) which covers any kind of machine manipulation of natural human languages.

I want to show here a couple of pre-processing examples that are normally the base for more sophisticated models and algorithms applied in NLP.

A sort of “Hello world” program, the simplest working program but illustrative.
When you have a collection of text, it would be really useful to have at first a vocabulary with word counts.
Lots of useful things can come from it, from simplest metrics like term frequency to sophisticated classifiers.

This example is also available as Jupyter notebook on GitHub.

Tokens

The basic atomic part of each text are the tokens. A token is the NLP name for a sequence of characters that we want to treat as a group.

For example we can consider group of characters separated by blank spaces, therefore forming words.

Tokens can be extracted by splitting the text using the Python function split().
Here is a simple example:

In [1]: sampleText = " The Elephant's 4 legs: this is THE Pub. \
  Hi, you, my super_friend! You can't believe what happened to \
  our * common friend *, the butcher. How are you?"
In [2]: textTokens = sampleText.split() # by default split by spaces
In [3]: textTokens
Out[3]: ['The', "Elephant's", '4', 'legs:', 'this', ...]
In [4]: print ("The sample text has {} tokens".format (len(textTokens)))
Out[4]: The sample text has 22 tokens

As you can see, tokens are words but also symbols like *. This is because we have split the string simply by using blank spaces. But you can pass other separators to the function split() such as commas.

Token frequency

I have the number of tokens for this sample text. Now I want to count the frequency for each token.

This can be done quickly by using the Python Counter from the module collections.
Counter is a sub-class of the dictionary data structure specialised for counting objects.

In [5]: from collections import Counter
In [6]: totalWords = Counter(textTokens); 
In [7]: totalWords
Out[7]: Counter({'*': 1,'*,': 1, '4': 1, "Elephant's": 1,
                 'Pub.': 1,'THE': 1, 'The': 1, 'the': 1,
                 'You': 1, 'you,': 1, 'you?': 1, ... })

Counter produces a dictionary with each token as the keys and its count as the value.
We see that there are a number of problems:

  • some word (like The/THE or You/you) is in two different tokens because of capital vs. small letter
  • some token contains punctuation marks which makes the same word counted twice (like you/you?)
  • some token consists of symbols (like *)

Luckily we can quickly fix them.

Remove capital letters

This is very easy to do in Python using the lower() function:

In [8]: loweredText = sampleText.lower()
In [9]: loweredText
Out[9]: " the elephant's 4 legs: this is the pub. hi, you, \
my super_friend! you can't believe what happened to our *  \
common friend *, the butcher. how are you?"
In [10]: textTokens = loweredText.split()
In [11]: totalWords = Counter(textTokens);  totalWords
Out[11]: Counter({'*': 1, '*,': 1, '4': 1, "elephant's": 1, 
                  'the': 3, 'you': 1, 'you,': 1,'you?': 1, ... })

Now the token “the” is counted correctly 3 times !

But other words like “you” are still wrongly counted because of the punctuation such as comma or question mark.

Remove punctuation and trailing spaces

Removing the extra spaces is very easy, by using the string function strip():

In [12]: strippedText = loweredText.strip(); strippedText
Out[12]: "the elephant's 4 legs: this is the pub. hi, you, \
my super_friend! you can't believe what happened to our *  \
common friend *, the butcher. how are you?"

Note that the initial blank space at the beginning is now gone.

To remove punctuation we can use regular expressions.
Regular expressions (regex in short) are a very powerful tool to match patterns in strings; basically you apply a sequence of meta-characters which have special meanings as defined by the regex language to the string you want to transform.

The way in regex to match specific characters is to list them inside square brackets. For example [abc] will match only a single a,b or c letter and nothing else.
A shorthand for matching sequential characters is to use the dash, for example [a-z] will match any single lowercase letter and [0-9] any single digit character.

To exclude characters from the matching we use the ^ (hat) symbol, for example [^abc] will match any single character except the letters a, b or c.

Finally there is also a special symbol for the whitespaces as they are so ubiquitous in text. Whitespaces are blank spaces, tabs, newlines, carriage returns. The symbol \s will match any of them.

The Python function re.sub() from the re module takes as input a pattern to match, a string that will replace it and a starting string,  and returns the starting string transformed with the matches replaced by the given substring using the given pattern.
For example re.sub(r'[s]’, ‘z’, “whatsapp”) will return “whatzapp”.

So one way to remove the punctuation is to replace any character that is NOT a letter, NOT a number NOR a whitespace with an empty substring:

In [13]: import re
In [14]: processedText = re.sub(r'[^a-z0-9\s]', '', strippedText) 
In [15]: processedText
Out[15]: 'the elephants 4 legs this is the pub hi you my superfriend you cant \
believe what happened to our common friend the butcher how are you'

Another useful symbol is \w which match ANY alphanumeric character or \W
which matches any NON alphanumeric character.
So, an alternative way could be:

In [14]: processedText = re.sub(r'[^\s\w]', '', strippedText) 
In [15]: processedText
Out[15]: 'the elephants 4 legs this is the pub hi you my superfriend you cant \
believe what happened to our common friend the butcher how are you'

Regex are really powerful and used extensively in text processing. You can find many tutorials on the web.

In [16]: textTokens = processedText.split()
In [17]: totalWords = Counter(textTokens); totalWords
Out[17]: Counter({'4': 1,'the': 3, 'you': 3, ... })

Finally, now we have the correct count for the words ‘the’ and ‘you’ !

Sorting the tokens

This is not really necessary, but a collection can be sorted easily:

In [18]: print (totalWords.items())    # you can access each item
Out[18]: dict_items([('4', 1), ('you', 3), ('the', 3), ... ])
In [19]: sorted(totalWords.items(), key=lambda x:x[1], reverse=True)
Out[19]: [('the', 3), ('you', 3), ('cant', 1), ... ('4', 1)]

The Python function sorted() takes care of the sorting; it takes as input the data structure to be sorted (totalWords.items() in this case, which is a list of tuples), a sorting key, e.g. what should I use to sort and finally if the sorting order is ascendent or descendent.
In this case the sorting key is an anonymous function (like a “one-shot” function without a name) called lambda in Python, which takes each item as input (the “x”) and returns the second element inside it (which is the token count).
For example, for the second totalWord item which is x=(‘you’,3) then x[0] would be the first tuple element: ‘you’ and  x[1] would be the second tuple element: 3, which is the wished key for sorting.

Lambda functions are extensively used in Python and will talk more about them in future.

Let’s put these results into functions we can re-use

def tokenise(text):
    return text.split() # split by space; return a list


def removePunctuation(text):
     return re.sub(r'([^\s\w_]|_)+', '', text.strip()) 


def preprocessText(text, lowercase=True):
    if lowercase:
        processedText = removePunctuation(text.lower())
    else:
        processedText = removePunctuation(text)
    return tokenise(processedText)

def getMostCommonWords(tokens, n=10):
    wordsCount = Counter(tokens) # count the occurrences
    return wordsCount.most_common()[:n]

Process a text file (tokenisation)

Python makes working with files pretty simple.
We can use one of the books from the Project Gutenberg, which is a library of digital books freely available to be downloaded.
In this case I downloaded “The Prince”, a book by Machiavelli.

fileName = "theprince.txt"

First of all, we need to open the file using the Python function open() that returns a file object.
We can then read the file content using the file method read():

f = open(fileName)
try:
    theText = f.read()  # this is a giant String
finally:
    f.close()   # we should always close the file once finished

print ("*** Analysing text: {}".format(fileName)) 
print ("The text is {} chars long".format (len(theText)))

tokens = preprocessText(theText)

print ("The text has {} tokens".format (len(tokens)))


 *** Analysing text: theprince.txt
 The text is 300814 chars long
 The text has 52536 tokens

This is an easy way to read the entire text but:

  1. It’s quite memory inefficient
  2. It’s slower than processing data as it is read, because it defers any processing done on read data until after all data has been read into memory, rather than processing as data is read.

A better way when handling with large or complex files is to read them line by line.

We can do this with a simple loop which will go through each line.
Note that the block keyword with will automatically close the file at the end of the block.

textTokens = [] # tokens will be added here
lines = 0       # just a counter

with open(fileName) as f:
    for line in f:
          # I know that the very first line is the book title
    if lines == 0:
        print ("*** Title: {}".format(line))
 
          # every line gets processed
    lineTokens = preprocessText(line) 
          # append the tokens to my list
    textTokens.extend(lineTokens)
 
    lines += 1 # finally increment the counter

print ("The text has {} lines".format (lines)) 
print ("The text has {} tokens".format (len(textTokens)))


*** Title: The Project Gutenberg EBook of The Prince, by Nicolo Machiavelli
The text has 5064 lines
The text has 52536 tokens


In [20]: textTokens[:10]  # display the first 10 tokens
Out[20]: ['the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'prince',
          'by', 'nicolo', 'machiavelli']

Unique tokens

Another useful data structure in Python is the set which is an unordered collection of distinct objects.
Arranging the tokens in a set means that they will be put only once,  resulting into a more compact data collection, useful to see how many distinct tokens are in a text or to see if a specific token is in the text or not.

In [21]: uniqueTokens = set(textTokens)

In [22]: print ("The text has {} unique tokens".format (len(uniqueTokens)))
In [23]: print ("Lexical diversity: each token in average is repeated 
                {} times".format(len(textTokens) / len(uniqueTokens)))

Out[22]: The text has 5630 unique tokens
Out[23]: Lexical diversity: each token in average is repeated 9.33 times
In [24]: sorted(uniqueTokens)[200:205]  # can be sorted alphabetically
Out[24]: ['accumulate', 'accuse', 'accused', 'accustomed', 'accustoms']

In [25]: 'accuse' in uniqueTokens
Out[25]: True

In [26]: 'phone' in uniqueTokens
Out[26]: False

Remove “stop words”

In [27]: getMostCommonWords(textTokens, 5)
Out[27]: 
[('the', 3109), ('to', 2107), ('and', 1935), ('of', 1802), ('in', 993)]

As you can see the most common words in this text are not really meaningful but we can remove in advance all the generic words like articles and conjunctions to get a better idea of the text tone.

Stop words are those common words that do not contribute much to the content or meaning of a document (e.g., for English language these could be “the”, “a”, “is”, “to”, etc.).
There are several lists of stop words available for many languages. For the English language you can find several files in the above Wikipedia article.

The way to filter these words would be to read the stop words from the file and put them into a Python set. Then you can check for each of your text tokens if it is also in the set.

f = open("stopwords.txt")
stopWordsText = f.read().splitlines() # splitlines is used to remove newlines
f.close()
stopWords = set(stopWordsText)   # create a set 
    # keep only tokens which are NOT in the stopwords set
betterTokens = [token for token in textTokens if token not in stopWords]
  
In [28]: betterTokens[:10]  # display the first 10 tokens now
Out[28]: ['project', 'gutenberg', 'ebook', 'prince', 'nicolo',
          'machiavelli', 'ebook', 'use', 'anyone', 'anywhere']

As you can see, now the tokens have been stripped by stop words like “The”.
Let’s see which are now the most common tokens in “The Prince”:

In [29]: getMostCommonWords(betterTokens)
Out[29]: [('prince', 222), ('men', 161), ('castruccio', 136),
       ('people', 115), ('many', 101), ('others', 96), ('time', 93),
       ('great', 89), ('duke', 88), ('project', 87)]

Generate a words cloud from a text

Here is a small example: word cloud is an image composed of words used in a particular text in which the size of each word indicates its frequency.

There are many online services where you can upload a list of words and get back a word cloud. I used this one : it is a javascript web service and accepts a text file with token and its occurrence per line.
So, I just need to tokenise a novel and count its tokens as we have seen above.
To make it more interesting we take as text a novel accessed directly from the web (the Project Gutenberg site).
The module urllib.request provides functions to open a file object given a URL.

from urllib.request import urlopen

def getWordCloud(filename):
    textTokens = [] # tokens will be added here
    lines = 0       # line counter
    path = "http://www.gutenberg.org/files/"      # web page URL
    url = path + filename + "/" + filename + ".txt"  # novel URL
    f = urlopen(url)
 
     for line in f: 
            # every line gets processed
            # texts are encoded in Unicode (UTF-8)
         lineTokens = preprocessText(line.decode('utf-8')) 
            # append the tokens to my list
         textTokens.extend(lineTokens)
 
         lines += 1
 
        # now remove the stop words
    fs = open("stopwords.txt")
    stopWords = set(fs.read().splitlines())
    fs.close()
    betterTokens = [token for token in textTokens if token not in stopWords]
 
    wordsCount = Counter(betterTokens) # count the occurrences
 
        # put each token and its occurrence in a file
        # each line is: "occurrence" "token"
    with open("wordcloud_"+filename+".txt", 'a') as fw:
        for line in wordsCount.most_common():
            fw.write(str(line[1]) + ' ' + line[0] + '\n')

Now we generate the two clouds.
I use two totally different novels but from the same period: Pride and Prejudice by Jane Austen (filename 1342 in Gutenberg) and A sentimental Journey through France and Italy by Laurence Sterne (filename 804 in Gutenberg)

In [30]: getWordCloud("1342") # Jane Austen

wordcloud-2

Unsurprisingly, the most common words are Mr, Elizabeth and Darcy.

In [31]: getWordCloud("804") # Laurence Sterne

wordcloud-3Here “said” is the most common word.

That’s all.
This example is also available as Jupyter notebook on GitHub.

Covariance and correlation

We have seen how to show the relation between two or more variables visually, using the scatter plot.

Let’s see now how to measure a relation between two variable in a mathematical way.

Covariance

Covariance is a type of value used in statistics to describe the linear relationship between two variables. It describes both how far the variables are spread out (a measure of how much one variable goes up when the other goes up) and the nature of their relationship: a positive covariance indicates that when one variable increases, the second increases and when one decreases the other decreases; on the other hand if the covariance is negative it means that an increase in one will cause a decrease in the other.

This answer on CrossValidated shows a nice visual explanation of covariance based on scatter plots:

mlcoy
Red “rectangles” between pairs are positive relations, blue are negative.

Continue reading “Covariance and correlation”