Giving attention to transformers

Transformer language models: These models use a type of neural network architecture called a transformer to make predictions. They are particularly effective at handling long sequences of words and are often used in natural language processing tasks such as machine translation and text summarization.

LSTMs and GRUs have a smaller window to reference from in order to learn in comparison to Transformers. This is why the paper “Attention Is All You Need” which introduces transformers is called as it is.

Untitled

Input Embeddings

This is the same as any other word embedding, it simply applies the learner representation of the word in relation to other words in the vocabulary.

Positional Encoding

However, not positional information about where the words are is encoded. This comes through recurrence in a RNN. In a transformer we have to add positional information manually since it’s not recurrent.

This is done by creating a set of positional encodings, which are also vectors, that are added to the embeddings for each word in the input. The positional encodings are created using the sine and cosine functions, where the input to the functions is the position of the word in the input sequence.

Untitled

The choice of using the sine and cosine functions is because they are periodic functions with linear properties that the model can easily learn to attend to. The sine function has a period of 2π and the cosine function has a period of 2π. This means that the sine and cosine functions are able to encode the position of the word in the input sequence in a cyclical manner, which is a suitable way to represent the position of the word in the input. Essentially, the model learns to recognize the unique patterns of the sin and cosine functions, and use that information to understand the position of each token in the sequence.

Basically, you add a vector to the embedding of each word, representing its position in the sentence. To create these positional encodings, for every odd index in the input vector, a vector is created using the cosine function, and for every even index, a vector is created using the sine function. These vectors are then added to the corresponding input embeddings.

This allows the model to combine information about the word itself with information about its position in the input sequence. This is useful because the meaning of a word can depend on its position in the sentence. The addition operation allows the model to learn this relationship between position and meaning in a simple and interpretable way.

Additionally, addition is commutative, meaning that the order of the summands does not matter. That makes it easy to add the embeddings in any order, and it is less prone to errors.

The other mathematical operations like subtraction or multiplication can also be used to combine the embeddings but it might not be as interpretable in this context or may not be as effective in learning the relationship between the word and its position in the sentence.

Example of positional encoding

Untitled

The scheme for positional encoding has a number of advantages.

The sine and cosine functions have values in [-1, 1], which keeps the values of the positional encoding matrix in a normalized range.

As the sinusoid for each position is different, you have a unique way of encoding each position.

You have a way of measuring or quantifying the similarity between different positions, hence enabling you to encode the relative positions of words.

Encoder Layer

Once we've added positional encoding to our input embeddings, the next step is to pass them through the encoder layer of the transformer. The encoder layer is responsible for "encoding" the input by processing it through multiple layers of self-attention and feed-forward neural networks. The self-attention mechanism allows the model to weigh certain parts of the input more heavily than others, based on their relevance to the task at hand. This is what allows the transformer to effectively handle input of varying lengths and understand the relationships between different parts of the input. By the time the input has passed through the encoder, it has been transformed into a dense, high-dimensional representation that the decoder can use to generate output.