Markov-Shannon, a simple text generator

On 2 May 2024 By mashimoIn data science, machine learningLeave a comment

Nowadays generating texts using Transformers neural network architectures reached high level of sophistication. But the idea to generate text from existing examples is very old.
For example, a Markov test generator (the idea for this is apparently due to Claude Shannon who described it in his Communication Theory paper) is named after Andrey Markov (1856-1922), who probably never saw one (at the word level, at least). A Markov chain is essentially a finite-state automaton where all the transitions are assigned probabilities (I described them previously in the POS tagging series part).

It works very simply: given a corpus of text, each word in the corpus becomes a state and each two-word sequence is a transition to another state. It’s assigned a probability determined by how many times it appears.
In his paper A Mathematical Theory of Communication, Shannon calls completely random series zero-ordered approximations. Putting randomly selected words after each other yields totally unintelligible lines. As one selects words according to their frequency in a huge corpus, the resulting text gets more natural. If we go further, and we take two-word or three-word or n-word sequences, we get better and better results. One can see these n-word sequences (or n-grams) as transitions from one word to the other.

This is an example of a text generator using this simple idea.

Continue reading →

Shannon’s entropy

On 26 April 2024 By mashimoIn machine learning, SoftwareLeave a comment

In information theory, the entropy of a random variable is the average level of “information” inherent to the variable’s possible outcomes.

The American mathematician Claude Elwood Shannon was interested in how to mathematically model the process of information transmission.

One of the main problems that Shannon encountered was how to measure the information content of a message. He published his findings in a 1948 article, “A Mathematical Theory of Communication” which has become one of the most cited scientific works in history and founded the field of information theory.

Shannon realized that the information content of a message or of any random variable is best thought of as how surprising its content is, how much uncertainty it resolves. As the basic measure of information content, he introduced the unit bit.

If you want to capture the information content of a random variable with N different outcomes that are all equally likely, how many bits does it take? Shannon showed that the answer is given by the
logarithm of N with base 2. At the suggestion of John von Neumann, he called his measure entropy following the notion of entropy that Boltzmann had introduced in thermodynamics and physics. Entropy is denoted with the uppercase Greek letter Eta, which is written like the English letter H.

H = log2(N)

Let’s see some example.

from math import log

log(1,2) #log base 2 of 1

0.0

If the outcome of a random variable can only take on one possible value with probability one, then you will not learn anything new. You already know the outcome it was going to take. So the information content of that random variable is zero.

Continue reading →

A simple generator of names

On 31 March 202426 April 2024 By mashimoIn data science, machine learning, TutorialLeave a comment

This is a character-level predictor: it predicts the next character in a sequence given an input sequence.
To predict the next character the baseline is using a simple lookup table based on the character frequency while a neural network (specifically a Convolutional Neural Network, CNN) allows to get better results.
It’s a classic NLP example, this one is based on a lecture from Karpathy.

You can follow this same notebook on my GitHub repository.

Read the dataset (names)

The input dataset is just a text file of names. Could be anything, this one is a list of given names.

words = open('../datasets/names.txt', 'r').read().splitlines()
len(words)

32033

The dataset contains more than 32k english given names, for example the first 8 names are:

words[:8]

[’emma’, ‘olivia’, ‘ava’, ‘isabella’, ‘sophia’, ‘charlotte’, ‘mia’, ‘amelia’]

A bigram is a two-character slice of a word.
Let’s see the first 3 names’ bigrams, row-by-row.

It’s more clear if we add tags to the start and to the end of each name:

for word in words[:3]:
    chs = ['<S>'] + list(word) + ['<E>']
    for char1,char2 in zip(chs,chs[1:]):
        print (char1, char2)

    <S> e
    e m
    m m
    m a
    a <E>
    <S> o
    o l
    l i
    i v
    v i
    i a
    a <E>
    <S> a
    a v
    v a
    a <E>

Build the bigram frequency dictionary

The idea is that we can count the frequency of each character in the bigram, including when it’s at the start or the end of a name. And we use this frequency as the probability each bigram can happen in the dataset. From there we can predict new names based on the relative frequency.
So first we build a dictionary with each bigram and how many times it appears.

Continue reading “A simple generator of names” →

The arrival of AI

On 28 September 20234 April 2024 By mashimoIn machine learningLeave a comment

AI is here. Or so seems. For sure the feeling is that it’s going through a sort of breathtaking acceleration. AI is intelligent. There’s real learning happening due to the scale of the compute and data.

Will we experience a paperclip apocalypse? Don’t panic, there is no existential threat to humanity. But, as Sun Tzu wrote in Art of War: “If you know the enemy and you know yourself, you need not fear”.

Leon Bottou and Bernhard Schölkopf in “Borges and AI” advocates that the Large Language Models’ capabilities inspire contradictory emotions (fear and greed) and this confusion stems from our ignorance of the nature of these systems and their implications for humankind:

We, humans, need a mental imagery that explains these complex processes through analogy […]. The imagery of science fiction tends to dominate debates of AI. We believe, however, that Jorge Luis Borges’ fiction provides a more compelling imagery and illuminates the relation between language models and AI.
arXiv:2310.01425 [cs.CL]

What’s intelligence?

First of all, one point to be made clear: intelligence is an ability that has existed on this planet since well before human beings so it is not just a human ability.
Before land animals existed – when on Earth there were only primordial oceans – prey and predators challenged each other using intelligence. For example, intelligence is used by a cat to cross the road without being run over.
Therefore let’s not expect intelligence to depend on language, consciousness, culture.
It has nothing to do with these things.

Let’s consider this small story I heard from Neil deGrasse Tyson: a company invited two candidates for an interview. The question for both is, “How tall is this building’s spire?”

Continue reading →

How to transcribe audio using a Large Language Model

On 31 August 202326 April 2024 By mashimoIn machine learning, TutorialLeave a comment

Whisper is an open-source speech recognition model developed by OpenAI

Here we can see how to use it convert an audio into text, a good example of the potentiality of the Large Language Models (LLM), neural networks trained on huge amount of text and audio.

The model can be installed directly from GitHub:

!pip install git+https://github.com/openai/whisper.git -q

Continue reading →

Remote Agile teams

On 18 July 20224 April 2024 By mashimoIn Project ManagementLeave a comment

The Manifesto for Agile Software Development states explicitly that “The most efficient and effective method of conveying information to and within a development team is face-to-face conversation.”

As we have seen, the idea is that face-to-face communication is the preferred way because it eliminates misunderstandigs, confusion and overhead.
But now remote or hybrid working is becoming the norm and is for sure here to stay.

Is the agile process compatible with people working remotely or distributed teams?

Working and collaborating remotely is coming with specific challenges but is not incompatible with Agile. The agile process can also thrive in a remote format, provided that the challenges are taken care of.
To perform best under these circumstances, it is very important to acknowledge these challenges.
We need to accept and embrace them and find strategies to reduce their undesired effects or we need to find ways to live with them.

Overall, being agile has never been more important than now: we need to continually learn, inspect and adapt to the situation that emerges.

Here are some best practices for smart working.

Continue reading →