Nowadays generating texts using Transformers neural network architectures reached high level of sophistication. But the idea to generate text from existing examples is very old.
For example, a Markov test generator (the idea for this is apparently due to Claude Shannon who described it in his Communication Theory paper) is named after Andrey Markov (1856-1922), who probably never saw one (at the word level, at least). A Markov chain is essentially a finite-state automaton where all the transitions are assigned probabilities (I described them previously in the POS tagging series part).
It works very simply: given a corpus of text, each word in the corpus becomes a state and each two-word sequence is a transition to another state. It’s assigned a probability determined by how many times it appears.
In his paper A Mathematical Theory of Communication, Shannon calls completely random series zero-ordered approximations. Putting randomly selected words after each other yields totally unintelligible lines. As one selects words according to their frequency in a huge corpus, the resulting text gets more natural. If we go further, and we take two-word or three-word or n-word sequences, we get better and better results. One can see these n-word sequences (or n-grams) as transitions from one word to the other.
This is an example of a text generator using this simple idea.
Continue reading “Markov-Shannon, a simple text generator”