Developing an AI system with Markov chains
Sure-Fire Success
Markov chains model systems that jump from state to state with predetermined probabilities, but can they help write new columns like this one after learning from previously written articles?
Hard to believe, but true: This edition of my programming column marks its 20-year anniversary. Lately, I've been diving into artificial intelligence topics, a field that is increasingly replacing traditional jobs, but I wonder if a machine could possibly one day replace authors like me? An AI system would certainly have a number of advantages over a human writer, because a robot would no doubt deliver the manuscript on time every time, to the amazement of my editors, who are not fond of my bad habit of watching deadlines swoosh by. Also, I could afford to work less hard and have more time to kick back, relax, and spend my days pursuing my surfing hobby at various Pacific beaches (Figure 1).
Moody like the Weather
An algorithm that writes columns automatically could be implemented as a so-called Markov chain, named after the Russian mathematician Andrei Markov. Markov chains are stochastic processes with different states that change according to predetermined probabilities.
The weather forecasting model shown in Figure 2, for example, says the chance of a sunny day following a "Sun" state is 65 percent. At 30 percent, the weather will change to the "Clouds" state, and at 5 percent, it goes directly to the "Rain" state. The probabilities apply only to the current state; a change to the next state does not require knowledge of any event in the past, so the probability of a transition is very easy to calculate.
In a transition matrix such as that shown in Figure 2, the computer only needs to select the row with the current state, such as the third row for the Clouds state (C), to find out that there is a 25 percent chance of the weather being sunny, a 25 percent chance of rain, and a 50 percent chance of skies remaining cloudy. Equipped with corresponding random generators, the model jumps back and forth between states, and a later analysis of where this Monte Carlo method leads eventually allows conclusions to be drawn about the most likely weather in the future.
Sentences from the Machine
Such "random walks" can predict not only the weather, they can generate sequences of words that sound like they were written by a human author. For this purpose, a program first analyzes the word sequences used in an extensive body of text. It notes which words follow one another in a sentence, and pieces together a new text from learned phrases. For example, it might realize that, in corpus, three given words are sometimes followed by word A and sometimes by word B. When generating new prose, the algorithm will roll an imaginary die each time to determine whether the three initial words are followed by A or B, according to the probabilities in which they occur in the original text.
To start off, the algorithm separates the body of text into words (tokens) and then analyzes it step by step with a rolling window of words. The first (n – 1) words of the window determine the state of the machine at the current time, and the last word in the window is the value the machine assumes during the transition to the next state.
Rolling Window
For example, for a window size of n=3, from the text
Humpty Dumpty sat on a wall,
Humpty Dumpty had a big fall.
it extracts the sequences "Humpty Dumpty sat," "Dumpty sat on," "sat on a," and so on. The first two words (Humpty Dumpty) respectively determine the current state of the machine; the following word ("sat") is the next state value. In the second part of the nursery rhyme, the algorithm stumbles again on the "Humpty Dumpty" state. However, it sees not the word "sat," but rather the word "had" following the current state. When it later creates random text, the algorithm now has two options: It can either follow up with "sat," or it can choose "had," and it will do so according to calculated random values. The result is text that shows slight human qualities because it copies word sequences, but occasionally performs crazy jumps between contexts, which can sound either confusing or funny.
The oldest Programming Snapshot in Linux Pro Magazine dates back to 2003 (although the German edition started in 1997), and Listing 1 [1] reads all raw manuscripts cobbled together of every single issue since then in the file all.txt
. The file name is first passed to the generate_from()
function in line 68. The statemap_gen
function starting in line 37 then splits the file into tokens with token_feed()
and passes on the resulting list of the window()
function from line 26, which accepts n=4 as the window width. As explained previously, the status of the Markov chain in this case is made up of the three last words; the fourth is part of the result state.
Listing 1
randomtext.py
To prepare the tokenizer of the nltk
package, it needs to be installed with pip3 install nltk
. Moreover, the tokenizer dataset punkt still needs to be downloaded separately in a Python shell (Figure 3). The tokenizer interprets punctuation marks, such as commas or colons (the regular expression in line 7 defines them all), as individual tokens. This approach later increases the readability and entertainment value of the random text.
Figure 4 shows how the algorithm generates a state file of the Markov chain from the nursery rhyme and the window length n=3. Note how the statemap generated offers both "sat" and "had" as a follow-up state for "Humpty Dumpty":
('Humpty', 'Dumpty'): ['sat', 'had']
The Python code shows some of the spice of the language, such as the yield()
operator in the token_feed()
function in line 16: So that Python can iterate over the components of an object using a for
loop, it must be of type Iterable
. These objects do not spill out their elements in one go, but they pass them through one piece at a time with each call to yield()
.
The advantage of this process is that the script does not need to store all of the data in a construct in memory; rather, it can get away with only calculating the data when needed. Therefore, an object can contain an infinite number of elements; as long as they are never needed all at once, the illusion stays alive despite limited memory.
Python also provides a variety of data structures. Double-ended queues, called deque
, as used in the window()
function in lines 26-35, can be modified really fast if the code limits itself to adding or removing elements to or from the head or tail of the queue. The only flaw is that the caller of the window()
function cannot manipulate the deque
object returned, or the algorithm gets confused.
For this reason, line 41 uses key=list(state)
to copy the elements of the deque
object into a Python list, which the statemap_gen()
function can modify to its heart's content. It interprets the first (n – 1) elements as a Markov state, and the last item as a possible next state.
In line 43, key
is a Python tuple that can no longer be modified. The dictionary statemap
maps such tuples to one or more follow-up state values. In this way, the algorithm can later look up possible follow-up state values quickly and select one randomly with random.choice()
in line 60, leading the text on unpredictable paths and imbuing the algorithm with human-like traits.
The result can be seen in Figure 5. However, it demonstrates that the applied Markov chain procedure does not automatically provide exciting new magazine articles. The grammar is still lacking enormously, and the content doesn't really make any sense. This column will probably have to be created manually for the foreseeable future.
Infos
- Listing for this article: ftp://ftp.linux-magazine.com/pub/listings/linux-magazine.com/205/
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.
-
Linux Sees Massive Performance Increase from a Single Line of Code
With one line of code, Intel was able to increase the performance of the Linux kernel by 4,000 percent.
-
Fedora KDE Approved as an Official Spin
If you prefer the Plasma desktop environment and the Fedora distribution, you're in luck because there's now an official spin that is listed on the same level as the Fedora Workstation edition.
-
New Steam Client Ups the Ante for Linux
The latest release from Steam has some pretty cool tricks up its sleeve.
-
Gnome OS Transitioning Toward a General-Purpose Distro
If you're looking for the perfectly vanilla take on the Gnome desktop, Gnome OS might be for you.
-
Fedora 41 Released with New Features
If you're a Fedora fan or just looking for a Linux distribution to help you migrate from Windows, Fedora 41 might be just the ticket.
-
AlmaLinux OS Kitten 10 Gives Power Users a Sneak Preview
If you're looking to kick the tires of AlmaLinux's upstream version, the developers have a purrfect solution.
-
Gnome 47.1 Released with a Few Fixes
The latest release of the Gnome desktop is all about fixing a few nagging issues and not about bringing new features into the mix.
-
System76 Unveils an Ampere-Powered Thelio Desktop
If you're looking for a new desktop system for developing autonomous driving and software-defined vehicle solutions. System76 has you covered.