In spite of its esoteric mathematical intricacies, automated natural language processing (NLP) as in conversational AI, draws on at least one primal everyday phenomenon. I’m referring to the concept of periodicity, i.e., cycles, periods, rhythms, repetitions, etc. Periodicity is a major principle through which we understand time, temporality, ordering, and sequencing and permeates so much of human experience, not least the life and labour of urban living. In his book Rhythmanalysis, the philosopher Henri Lefebvre observed:
Great cyclical rhythms last for a period and restart: dawn, always new, often superb, inaugurates the return of the everyday .
In my book The Tuning of Place, I argued for an appreciation of the complex interactions between periodicities, from variations in astronomical cycles (sun and moon transits) to machine components that are out of sync and set up secondary vibrations. See post Time and tide wait for no one.
Sequencing in NLP
It’s time to say something about the means by which natural language processing (NLP) following the so-called Transformer model (as in ChatGPT) captures information about sequences of tokens (words). It does so by exploiting properties of disparate periodic cycles, i.e. sine and cosine curves of different periodicities. The method also applies to training a neural network to detect patterns in sequences of steps in a planning task, as explained in my previous posts. See Words in order, AI learns ABC, and Robot probes city grid.
The method takes a concept that we normally think of in linear terms, a sequence of words stretching ad infinitum and periodises it, i.e., turns it into cycles that repeat.
During training, NLP neural networks receive sequences of tokens as inputs. The network does not receive these tokens as strings of characters, but as vectors, i.e. lists of floating point numbers that capture the relationships between any token and all the other tokens in the training corpus. The vector consists of the coordinates of the token in N-dimensional semantic feature space, itself derived through pre-processing in a neural network. See post Architecture in multi-dimensional feature space.
The NLP program effectively stores these semantic encoding (often call “embedding”) vectors in a lookup table, and converts human-readable token sequences to vector encodings and vice versa.
In earlier blog posts I explored 2-dimensional semantic encodings, as well as encodings in 4 dimensions and 5 dimensions, and the illustration that follows will use 10-dimensional encodings. But serious NLPs typically deploy many more dimensions. In fact, there are repositories of encoding vectors available online. I’ve recently experimented with the semantic encodings from Facebook, which offers encodings of 300 dimensions. These are trained on text sources such as Wikipedia and a repository of web data at commoncrawl.org and are available for over 150 languages.
In the post AI learns ABC, I demonstrated how a training program calculates a list of positional encoding vectors across a context window of say 6 tokens in sequence and adds each of these vectors to the semantic encoding vectors to each token in the context window. The vectors in the context window become the inputs to the neural network, with the token that follows as the output. The neural network is thereby trained to predict the next sensible word that might reasonably follow from the token vectors in the context window.
When that input-output coupling is embedded into the behaviour of the network, the program shifts the context by one token to establish a new input-output pairing. Whatever the dimensions of the semantic encoding deployed in the NLP training model, the positional encoding vector needs to have the same dimensions. The two sets of vectors are added arithmetically.
Positional encoding and cycles
It’s simple enough to explain the derivation of the positional encoding vectors. The mysterious element is an account of why they should work to encode information about token positions sufficient to exert an influence on training a neural network to reproduce and predict token sequences, i.e., produce fully-formed sentences as output from NLP systems, or other sequences such as in route planning.
In the table below I show a series of sine and cosine curves plotted across a range of 0-9 grid units along a horizontal axis. The vertical axis shows the values along the curves ranging from -1 to +1. The curves that originate from +1 are all cosine curves. The curves starting at zero are sine curves.
The horizontal units 0-9 represent positions across a context window of 10 tokens. There are 10 curves. Each curve is attributed to a different dimension in a 10-dimensional positional encoding vector. The values in that vector alternate between the sine and cosine curve values at their sequence position in the context window.
So at position 0, the first position in the context window, the positional vector is <0, 1, 0, 1, 0, 1, 0, 1, 0, 1>. The rest of the values can be read off the values at position 1 on the graph and alternating between the sine and cosine values starting with the curves with the shortest amplitude. The values are of course calculated from the formulas in the program. I show this process in more detail in the post Words in order.
Another way to visualise the distribution of these values is via a heat map. In the following chart, the 10 positions in the sequence are along the horizontal axis, and the rows are the positions in the positional encoding (embedding) vectors. The legend on the right shows the colours corresponding to each value in the positional vector.
The pixel values on this graph are the actual positional encoding values (-1 to +1) at any position along the context window, and any coordinate in the encoding vector at that position. This graph shows a 10 dimensional positional encoding vector.
Differences across the context window
The heat map indicates some interesting properties of the distribution of values. The right side of the graph shows the most variation amongst the coordinates of the encoding vector. Furthermore, the variation is most pronounced amongst the low coordinates of the vector (towards the top of the plot), privileging the values towards the top right corner of the data space makes sense as it accords greater importance to variation amongst the tokens closest to the most recent token in the prediction sequence and greater importance to the values at the lower coordinates in the positional vector.
The patterns become more interesting with larger vectors and longer context windows. Here are the curves for positional vectors of 100 dimensions and a context window of 100 tokens.
Here is the corresponding heat map.
The formulas for the calculation of the sine and cosine values are
I’ve been following the positional encoding method used in the seminal “Attention is All You Need” paper that introduced the Transformer model of NLP, and this is the formula the authors used.
The variable pos is the position of a token along the context window (0-9) in my example. The variable PE is the positional encoding value. The variable d is the depth (or dimensionality) of the encodings, which is essentially the number of dimensions in the output vectors from the model. I’ll paste the rest of the variable explanations from ChatGPT, as I can’t easily display formulas in WordPress.
ChatGPT helps explain that the purpose of this term
is to create a distinct encoding for each position in the context window. As i increases (i.e., for higher dimensions in the encoding), the periodicity of the sine and cosine functions increases. This results in a unique combination of sine and cosine values for each position across the encoding dimensions, allowing the model to differentiate between positions in a sequence. The use of a large base like 10,000 ensures that even for long sequences, the positional encodings remain distinct. I used a smaller base term of 10 in my plots as it makes the plots clearer.
I plotted these curves and heat maps via programming in Python, with help from ChatGPT. As someone interested in form and space I find the visualisation of the resulting interference patterns intriguing. But how do the formulae work to instil sequential information into NLP training data, sufficient for a neural network to capture and reproduce sequential patterns?
Why use curves?
The simple answer is that the method seems to work empirically, after much testing by NLP experts. My own modest foray into positional encoding, as illustrated in the previous posts, confirms this practical point. If I take out the positional encoding from the calculations then the resultant sequential predictions seem random.
According to NLP researchers (e.g. in the seminal “Attention Is All You Need” paper) the logic of the procedure lies in the fact that the sinusoidal function’s periodic nature provides information for the training model to identify the relative position of elements, even for positions not encountered during training. The model can interpolate or extrapolate positional information.
These positional encodings enable the training model to latch on to patterns in the relative positions of tokens. By having more pronounced variations in the encodings for recent tokens and the lower dimensions of the encodings, the model can prioritize or give more attention to these recent tokens. That seems to be crucial for sequential tasks.
It helps sequential training that the positional vectors are unique for each position in a context window, with vector values in the range -1 to +1.
It’s worth noting that positional encodings on their own are insufficient. Semantic encodings are needed as an efficient and “dense” means of mapping tokens to neural network inputs. Tokens with similar contexts end up having similar vector representations. The vector inputs already provide important information about word proximities in the overall corpus, and this information is enhanced by the arithmetical addition of semantic and positional vector encodings.
Having now attempted to explain positional encoding in terms of periodicity, we can dispose of any idea that to train a neural network on sequencing we simply provide an index number for each position in a context window and somehow add that in to the semantic encodings. Sequences, language, and human affairs are best grasped by appropriating their cyclical nature. Lefebvre provides an account of the antagonism between the linear and the cyclical in an extension of the quote I gave above.
Great cyclical rhythms last for a period and restart: dawn, always new, often superb, inaugurates the return of the everyday. The antagonistic unity of relations between the cyclical and the linear sometimes gives rise to compromises, sometimes to disturbances. The circular course of the hands on (traditional) clock-faces and watches is accompanied by a linear tick-tock. And it is their relation that enables or rather constitutes the measure of time (which is to say, of rhythms) .
- Coyne, R. (2010). The Tuning of Place: Sociable Spaces and Pervasive Digital Media. Cambridge, MA: MIT Press.
- Lefebvre, H. (2004). Rhythmanalysis: Space, Time and Everyday Life. London: Continuum.
- Saeed, M. (2023). “A Gentle Introduction to Positional Encoding in Transformer Models, Part 1.” Machine Learning Mastery 6 January. Retrieved 30 October 2023, from https://machinelearningmastery.com/a-gentle-introduction-to-positional-encoding-in-transformer-models-part-1/.
- Vaswani, A., N. Shazeer, et al. (2017). Attention Is All You Need. 31st Conference on Neural Information Processing Systems. Long Beach, CA, USA: pp. 1-15.