NLP3o: A web app for simple text analysis

We have seen earlier an introduction of few NLP (Natural Language Processing) topics, specifically how to tokenise a text, remove the punctuation and its stop words (the most common words such as the conjunctions) and how to find the top used words.

Now we can see – as an example – how to put all these inside a simple web app.

The entire source code is available on GitHub, so you can better follow along; and you can see it in action in a Heroku dyno.

Screen Shot 2017-07-29 at 14.55.08

Flask

I will use Flask, a Python micro-framework to create interactive web applications that provides support for web templates and routing.
You can find many tutorials on the web about how to install and start with it (it’s very simple).

The app is organised in these folders:

  • flask/     # contains the framework itself
  • app/      # the Python main files
  • app/templates    # the HTML files
  • app/static    # images, text files, anything static

The Web Form

Let’s start by creating a simple HTML page with a text form (for the user to insert a text) and one submit button (I have actually put a list of several buttons for different functions, but for the moment all others are disabled and look as grey buttons):

NLP3O-main01

To handle our web forms we are going to use the Flask-WTF extension, which in turn wraps the WTForms project in a way that integrates nicely with Flask apps.

This is the app/forms.py file:

from flask_wtf import FlaskForm
from wtforms import TextAreaField
from wtforms.validators import DataRequired

class InputTextForm(FlaskForm):
  inputText = TextAreaField(validators=[DataRequired()])

Web forms are represented in Flask-WTF as classes. A form subclass simply defines the fields of the form as class variables. In our case we are going to add just one field for the text area.

I believe the class is pretty much self-explanatory. We imported the FlaskFormclass, and the form field class that we need: TextAreaField.

The DataRequired import is a validator, a function that can be attached to a field to perform validation on the data submitted by the user. The DataRequired validator simply checks that the field is not submitted empty. There are many more validators included with Flask-WTF for different tasks.

Templates

Templates are a way to separate the logic of an application from the layout or presentation of the web page.

With Flask, we can add to the standard HTML pages, some placeholders for the dynamic content enclosed in curly brackets {{ ... }} sections. These will be dynamically filled according to parameters set by the Python main app.

The dynamic content can also include directives (in this case the % sign is used).
For example, this is app/templates/base.html file, the common part that will be included in every other HTML page:

{% block head %}
   {% if title %}{{ title }}
   {% else %}Welcome to this site
   {% endif %}
{% endblock %}

... some HTML code ...
{% block content %} {% endblock %}

It receives an optional parameter, the title. The directive simply tells to use title if it is passed as argument, otherwise a standard string.

Another useful directive is %extends that allows to inherit templates.

Template inheritance allows you to build a base “skeleton” template that contains all the common elements of your site and defines blocks that child templates can override.

All the block tag does is tell the template engine that a child template may override those portions of the template.

For example, this is the app/templates/index.html, i.e. the main page:

{% extends "base.html" %}

{% block content %}
... some new overriding HTML code ...
{% endblock %}

This will inherit the head (title, CSS, …) from base.html and also the body (navigation part) but will add specific content to the body.

Now add the web form to the template

The actual fields of our form are rendered by the field objects, we just need to refer to a {{form.field_name}} template argument in the place where each field should be inserted.
Some fields can take arguments. In our case, we are using a placeholder:

<form method="post" name="inputText">
  You can enter your text here:
  {{ form.inputText(size=400, placeholder="Write something ...") }}
  <button type="submit" value="TA">Basic Analysis</button>
</form>

Note that the name of the text area inside the form (inputText) is the same as the field defined above in the class InputTextForm.

The submit field does not carry any data so it doesn’t need to be defined in the form class, it will be a regular HTML form field.
It will be managed inside the Python app, thanks to the routing mechanism, another important feature available from Flask.

Routing

Routing refers to determining how your application responds to a client request. In a web application, routing is the process of using URLs to drive the user interface (UI).
We want to control – via  a Python function – how to respond when the user types the URL …/index (and all other pages) in the browser. For this we can use the Flask decorator @app.route :

from flask import Flask
app = Flask(__name__)

@app.route("/")
@app.route('/index')
def hello():
  return "Hello World!"

In this simple example, it defines that when the URL is the root page or index then it calls the function hello().
Now we apply the routing mechanism to our main web page.

Render a template

To render the template we have to import the render_template function from the Flask framework.
This function takes a template filename and a variable list of template arguments and returns the rendered template, with all the arguments replaced.

Under the covers, the render_template function substitutes {{...}} blocks with the corresponding values provided as template arguments.

Here is the code to render our main page :

   # initialisation: new object from our form class
theInputForm = InputTextForm()

@app.route('/', methods=['GET'])
@app.route('/index', methods=['GET'])
def initial():
    # render the initial main page
  return render_template('index.html',
                         title = 'NLP3o - Your input',
                         form = InputTextForm())

As you can see, we pass three parameters to the render_template() function: the filename plus additionally the title to be displayed on the web page and the form class, both present in our template.

It is the app.route decorator that decide which function to call, the function name can be anything.

Remember that we have defined above in the file forms.py a class called InputTextForm which has a field called inputTextInside; now we can use it to retrieve the user’s input.

Note that we specify to use the Python function manageRequest() when the URL is the main page and the HTTP request method is GET. This corresponds to render the page and the web form in the browser.
We will invoke a different function for the HTTP request method POST (the web form submit button has been pressed):

@app.route('/', methods=['POST'])
@app.route('/index', methods=['POST'])
def manageRequest():
  if theInputForm.validate_on_submit():
    userText = theInputForm.inputText.data
    # do something with the text

Now that we have the text entered by the user inside a variable userText.
We can analyse it and render the results in a new page.

Text Analysis

I define in the file nlp3o.py a class that models a text, its attributes and the methods that can be applied on it:

class TextAnalyser:
  'Text object, with analysis methods'
  def __init__(self, inputText, language = "EN"):
    self.text = inputText
    self.tokens = []
    self.language = language

This above is the initialisation, that will instantiate an object of class TextAnalyser. It requires as input the text itself and its language (optional).

Here are some of the methods defined, which come directly from the very first example described in a previous post.

def length(self):
  return len(self.text)

def tokenise(self):
  self.tokens = self.text.split() # split by space; return a list
  return self.tokens

def getTokens(self):
  return len(self.tokens)

def removePunctuation(self):
  import re
  self.text = re.sub(r'([^\s\w_]|_)+', '', self.text.strip()) # remove punctuation

def preprocessText(self, lowercase=True):
  if lowercase:
    self.text = self.text.lower()
  self.removePunctuation()
  self.tokenise() 

def uniqueTokens(self):
  return (len(set(self.tokens)))

def getMostCommonWords(self, n=10):
  from collections import Counter
  wordsCount = Counter(self.tokens) # count the occurrences
  return wordsCount.most_common()[:n]

and here we can see how it is instantiated in the views.py file, function manageRequest(), based on the input text that we retrieved from the form:

from .nlp3o import TextAnalyser # the text class

def manageRequest():
  ...
    # start analysing the text
  myText = TextAnalyser(userText) # new object

    # remove punctuation and get tokens
  myText.preprocessText()

   # render the html page
  return render_template('results.html',
                         title='Text Analysis',
                         numChars = myText.length(),
                         numTokens = myText.getTokens(),
                         uniqueTokens =myText.uniqueTokens(),
                         commonWords = myText.getMostCommonWords(10))

as you can see it is quite simple.

First declare a variable myText of type TextAnalyser: it will create an object based on the userText, on which you can call its methods.
First invoked method is the preprocessText(), which will split the text in tokens and remove the punctuation then when calling the render_template function, it uses the other methods to get the text length, tokens, the most common words, and so on.

These results can be used to render the results page, results.html:

<h1>NL3Po - Basic Text Analyser</h1>
This tool analyses your plain text and
tells you the most common words.
<h2>Main statistics:</h2>
Total characters: {{ numChars }} (include spaces)
  Total tokens: {{ numTokens }}
  Unique tokens: {{ uniqueTokens }}
  Lexical diversity: {{ numTokens / uniqueTokens }}
  Top 10 words: {{ commonWords }}

Note the placeholders in curly brackets that will be substituted by the parameters of the render_template function above. Very simple.

Remove stop words

If you remember from our previous example, the top used words are heavily polluted by common words such as “and”, “with”, “you” called stop words.

Let’s add the option to strip the text from them to our app.

First we add in the class for the web form two additional fields for the checkboxes in the file forms.py:

from wtforms import TextAreaField, BooleanField
class InputTextForm(FlaskForm):
    inputText = TextAreaField(validators=[DataRequired()])
    ignoreCase = BooleanField('ignore case', default=True)
    ignoreStopWords = BooleanField('ignore stopwords', default=True)

Then we add in the web form two placeholders for them, to give the user the option to remove stop words and lower the case inside.
Changes to the index.html file:

<form>
 ...
<legend>Settings</legend>

  {{ form.ignoreCase }} ignore case
  {{ form.ignoreStopWords }} ignore stopwords
    Language:
    <input type="radio" name="lang" value="EN">English
    <input type="radio" name="lang" value="DE">German
    <input type="radio" name="lang" value="IT">Italian

Note that we added two checkboxes (one for ignoring case, that makes all text lowercase) and one to remove the stop words that has additional radio buttons: you need to specify for which language are the stop words; if I have a German text then I will want to remove the the German common words, not English ones.

We need also to update the function manageRequests() inside the views.py file to consider these two new input fields:

def manageRequest():
  ...

  if theInputForm.validate_on_submit():
    userText = theInputForm.inputText.data
    language = request.form['lang'] # which language?
    # start analysing the text
  myText = TextAnalyser(userText, language) # new object, now with language

  myText.preprocessText(lowercase = theInputForm.ignoreCase.data,
                        removeStopWords = theInputForm.ignoreStopWords.data)

As you see, you access the radio buttons using request.form[‘lang’] (lang is the name of the button specified in the web form) and you access the checkboxes directly from the form class object (using the names defined in the class InputTextForm).

Remember finally to pass these values to the TextAnalyser methods in the nlp3o.py file:

class TextAnalyser:
  'Text object, with analysis methods'
  def __init__(self, inputText, language = "EN"):
    ...
    self.language = language  # set the language
    self.stopWords = set(readStopwords(language)) # set the stop words
    ...

  def removeStopWords(self):
    self.tokens = [token for token in self.tokens if token not in self.stopWords]

  def preprocessText(self, lowercase=True, removeStopWords=False):
    ...
    if removeStopWords:
      self.removeStopWords()

The readStopWords() function simply reads them from a static file.

You can see the complete code in the repository on Github.

The code contains more functionalities like the option to read an entire book (from a static file pre-defined) and a d3 chart for the top words, very similar to the one we have already seen.

Here are a couple of screenshots of the app:

screen-shot-2017-07-29-at-14-29-17.png
Basic Analysis of Melville’s Moby Dick
Screen Shot 2017-07-29 at 14.29.39
Basic Analysis of Kafka’s Urteil

Note: this post is part of a series about Machine Learning with Python.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s