How to make up words

ETAOINS are the 7 most commonly used letters in the English language. See post: Counting letters. Perhaps we could communicate with just those seven letters.

https://unscramblex.com is a website that provides all anagrams of up to 15 characters. There are 178 anagrams of ETAOINS. If you want include words where letters occur together (EE, TT, etc), or words in which letters appear more than once then you need to include letters in the initial character string more than once. EETTAAOOIINNSS gives 690 words, ranging from on to assentation. That’s as good a corpus as any to derive a table of probabilities suitable for analysis (or at least illustration) of a Markov chain.

The letter E followed by the letter E occurs in ten words in the corpus, E followed by T occurs 55 times. E at the end of a word occurs 110 times. I calculated these numbers by feeding the corpus of 690 words into a spreadsheet. I included a blank space to signal the start or the end of a word

From that data, the spreadsheet can calculate probabilities. The probability that an E is followed by another E is 0.02, that it’s followed by a T is 0.11, or that it comes at the end of a word is 0.23.

A Markov network graph of the above tabulated data would have 8 nodes connected by 64 directed arrows. That’s too cumbersome to draw. Here’s the graph for just four nodes E, T, A and a blank space. Lines exiting a node have to add up to a probability of 1.0.

The matrix and network would be much larger for all the letters of the alphabet. Unscramblex.com is useful for generating words in Scrabble and crossword puzzles. The corpus of words I used is pretty high-brow. A corpus of words used in everyday speech and writing would show a different matrix. With further refinements, the matrix provides a signature of someone’s writing style and can be used to identify style and authorship of the kind used in psychometric profiling, e.g. the Cambridge University psychometric profiling tool.

The method also provides a step on the way to decrypting a message in a simple substitution cipher. Knowing what letters are likely to follow in sequence reduces the search space for the original plain text message.

With greater sophistication, the Markov model can also generate words and sentences. My crude implementation here generates existing and new words like TE NESTIN AS TESATESA NOO and SESESES, but also AAAAAAAAA!

Bibliography

Hayes, Brian. 2013. First Links in the Markov Chain. American Scientist, (102) 2.
Hayes, Brian. 2013. First Links in the Markov Chain. American Scientist. Available online: https://www.americanscientist.org/article/first-links-in-the-markov-chain (accessed 8 December 2021).
Lee, Dar-Shyang. 2002. Substitution Deciphering Based on HMMs with Applications to Compressed Document Processing. IEEE Transactions on Pattern Analysis and Machine Intelligence, (24) 12, 1661-1666.
Vobbilisetty, Rohit, Fabio Di Troia, Richard M. Low, Corrado Aaron Visaggio, and Mark Stamp. 2017. Classic cryptanalysis using hidden Markov models. Cryptologia, (41) 1, 1-28.