Transformers from Scratch!

by Peter Bloem - 18 Aug 2019

Flames! Warning: Contents May Be Rad! Flames!

Transformers are a very exciting family of machine learning architectures. Many good tutorials exist (e.g. [1, 2]) but in the last few years, transformers have mostly become simpler, so that it is now much more straightforward to explain how modern architectures work. This post is an attempt to explain directly how modern transformers work, and why, without some of the historical baggage.

I will assume a basic understanding of neural networks and backpropagation. If you’d like to brush up, this lecture will give you the basics of neural networks and this one will explain how these principles are applied in modern deep learning systems.

A working knowledge of Pytorch is required to understand the programming examples, but these can also be safely skipped.


Self-attention

Rainbow Divider

The fundamental operation of any transformer architecture is the self-attention operation. We'll explain where the name "self-attention" comes from later. For now, don't read too much in to it.

Self-attention is a sequence-to-sequence operation: a sequence of vectors goes in, and a sequence of vectors comes out. Let’s call the input vectors \(\x_1, \x_2, \ldots, \x_t\) and the corresponding output vectors \(\y_1, \y_2, \ldots, \y_t\). The vectors all have dimension \(k\). To produce output vector \(\y_\rc{i}\), the self attention operation simply takes a weighted average over all the input vectors

$$ \y_\rc{i} = \sum_{\gc{j}} w_{\rc{i}\gc{j}} \x_\gc{j} \p $$

Where \(\gc{j}\) indexes over the whole sequence and the weights sum to one over all \(\gc{j}\). The weight \(w_{\rc{i}\gc{j}}\) is not a parameter, as in a normal neural net, but it is derived from a function over \(\x_\rc{i}\) and \(\x_\gc{j}\). The simplest option for this function is the dot product:

$$ w'_{\rc{i}\gc{j}} = {\x_\rc{i}}^T\x_\gc{j} \p $$

Note that \(\x_\rc{i}\) is the input vector at the same position as the current output vector \(\y_\rc{i}\). For the next output vector, we get an entirely new series of dot products, and a different weighted sum. The dot product gives us a value anywhere between negative and positive infinity, so we apply a softmax to map the values to \([0, 1]\) and to ensure that they sum to 1 over the whole sequence:

$$ w_{\rc{i}\gc{j}} = \frac{\text{exp } w'_{\rc{i}\gc{j}}}{\sum_\gc{j} \text{exp }w'_{\rc{i}\gc{j}}} \p $$

And that’s the basic operation of self attention.

Notepad Icon A visual illustration of basic self-attention. Note that the softmax operation over the weights is not illustrated.

A few other ingredients are needed for a complete transformer, which we’ll discuss later, but this is the fundamental operation. More importantly, this is the only operation in the whole architecture that propagates information between vectors. Every other operation in the transformer is applied to each vector in the input sequence without interactions between vectors.


Understanding why self-attention works

Rainbow Divider

Despite its simplicity, it’s not immediately obvious why self-attention should work so well. To build up some intuition, let’s look first at the standard approach to movie recommendation.

Let’s say you run a movie rental business and you have some movies, and some users, and you would like to recommend movies to your users that they are likely to enjoy.

One way to go about this, is to create manual features for your movies, such as how much romance there is in the movie, and how much action, and then to design corresponding features for your users: how much they enjoy romantic movies and how much they enjoy action-based movies. If you did this, the dot product between the two feature vectors would give you a score for how well the attributes of the movie match what the user enjoys.

If the signs of a feature match for the user and the movie—the movie is romantic and the user loves romance or the movie is unromantic and the user hates romance—then the resulting dot product gets a positive term for that feature. If the signs don’t match—the movie is romantic and the user hates romance or vice versa—the corresponding term is negative.

Furthermore, the magnitudes of the features indicate how much the feature should contribute to the total score: a movie may be a little romantic, but not in a noticeable way, or a user may simply prefer no romance, but be largely ambivalent.

Of course, gathering such features is not practical. Annotating a database of millions of movies is very costly, and annotating users with their likes and dislikes is pretty much impossible.

What happens instead is that we make the movie features and user features parameters of the model. We then ask users for a small number of movies that they like and we optimize the user features and movie features so that their dot product matches the known likes.

Even though we don’t tell the model what any of the features should mean, in practice, it turns out that after training the features do actually reflect meaningful semantics about the movie content.

Notepad Icon The first two learned features from a basic matrix factorization model. The model had no access to any information about the content of the movies, only which users liked them. Note that movies are arranged from low-brow to high-brow horizontally, and from mainstream to quirky vertically. From [4].

See this lecture for more details on recommender systems. For now, this suffices as an explanation of how the dot product helps us to represent objects and their relations.

This is the basic principle at work in the self-attention. Let’s say we are faced with a sequence of words. To apply self-attention, we simply assign each word \(\bc{t}\) in our vocabulary an embedding vector \(\v_\bc{t}\) (the values of which we’ll learn). This is what’s known as an embedding layer in sequence modeling. It turns the word sequence \(\bc{\text{the}}, \bc{\text{cat}}, \bc{\text{walks}}, \bc{\text{on}}, \bc{\text{the}}, \bc{\text{street}}\) into the vector sequence

$$\v_\bc{\text{the}}, \v_\bc{\text{cat}}, \v_\bc{\text{walks}}, \v_\bc{\text{on}}, \v_\bc{\text{the}}, \v_\bc{\text{street}} \p $$

If we feed this sequence into a self-attention layer, the output is another sequence of vectors

$$\y_\bc{\text{the}}, \y_\bc{\text{cat}}, \y_\bc{\text{walks}}, \y_\bc{\text{on}}, \y_\bc{\text{the}}, \y_\bc{\text{street}} $$

where \(\y_\bc{\text{cat}}\) is a weighted sum over all the embedding vectors in the first sequence, weighted by their (normalized) dot-product with \(\v_\bc{\text{cat}}\).

Since we are learning what the values in \(\v_\bc{t}\) should be, how "related" two words are is entirely determined by the task. In most cases, the definite article the is not very relevant to the interpretation of the other words in the sentence; therefore, we will likely end up with an embedding \(\v_\bc{\text{the}}\) that has a low or negative dot product with all other words. On the other hand, to interpret what walks means in this sentence, it's very helpful to work out who is doing the walking. This is likely expressed by a noun, so for nouns like cat and verbs like walks, we will likely learn embeddings \(\v_\bc{\text{cat}}\) and \(\v_\bc{\text{walks}}\) that have a high, positive dot product together.

This is the basic intuition behind self-attention. The dot product expresses how related two vectors in the input sequence are, with “related” defined by the learning task, and the output vectors are weighted sums over the whole input sequence, with the weights determined by these dot products.

Before we move on, it’s worthwhile to note the following properties, which are unusual for a sequence-to-sequence operation:

  • There are no parameters (yet). What the basic self-attention actually does is entirely determined by whatever mechanism creates the input sequence. Upstream mechanisms, like an embedding layer, drive the self-attention by learning representations with particular dot products (although we’ll add a few parameters later).
  • Self attention sees its input as a set, not a sequence. If we permute the input sequence, the output sequence will be exactly the same, except permuted also (i.e. self-attention is permutation equivariant). We will mitigate this somewhat when we build the full transformer, but the self-attention by itself actually ignores the sequential nature of the input.

In Pytorch: basic self-attention

Rainbow Divider

What I cannot create, I do not understand, as Feynman said. So we’ll build a simple transformer as we go along. We’ll start by implementing this basic self-attention operation in Pytorch.

The first thing we should do is work out how to express the self attention in matrix multiplications. A naive implementation that loops over all vectors to compute the weights and outputs would be much too slow.

We’ll represent the input, a sequence of \(t\) vectors of dimension \(k\) as a \(t\) by \(k\) matrix \(\X\). Including a minibatch dimension \(b\), gives us an input tensor of size \((b, t, k)\).

Computer Icon

We represent the input, a sequence of \(t\) vectors of dimension \(k\) as a \(t\) by \(k\) matrix \(\X\). Including a minibatch dimension \(b\), gives us an input tensor of size \((b, t, k)\). The set of all raw dot products \(w'_{\rc{i}\gc{j}}\) forms a matrix, which we can compute simply by multiplying \(\X\) by its transpose:

import torch
import torch.nn.functional as F

# assume we have some tensor x with size (b, t, k)
x = ...
raw_weights = torch.bmm(x, x.transpose(1, 2))
# - torch.bmm is a batched matrix multiplication. It
#   applies matrix multiplication over batches of
#   matrices.

Then, to turn the raw weights \(w'_{\rc{i}\gc{j}}\) into positive values that sum to one, we apply a row-wise softmax:

weights = F.softmax(raw_weights, dim=2)

Finally, to compute the output sequence, we just multiply the weight matrix by \(\X\). This results in a batch of output matrices \(\Y\) of size (b, t, k) whose rows are weighted sums over the rows of \(\X\).

y = torch.bmm(weights, x)

That’s all. Two matrix multiplications and one softmax gives us a basic self-attention.


Additional tricks

Rainbow Divider

The actual self-attention used in modern transformers relies on three additional tricks.

1) Queries, keys and values

Every input vector \(\x_\rc{i}\) is used in three different ways in the self attention operation:

  • It is compared to every other vector to establish the weights for its own output \(\y_\rc{i}\)
  • It is compared to every other vector to establish the weights for the output of the \(\gc{j}\)-th vector \(\y_\gc{j}\)
  • It is used as part of the weighted sum to compute each output vector once the weights have been established.

These roles are often called the query, the key and the value (we'll explain where these names come from later). In the basic self-attention we've seen so far, each input vector must play all three roles. We make its life a little easier by deriving new vectors for each role, by applying a linear transformation to the original input vector. In other words, we add three \(k \times k\) weight matrices \(\W_q\), \(\W_k\), \(\W_v\) and compute three linear transformations of each \(x_\rc{i}\), for the three different parts of the self attention:

$$ \begin{align*} \q_\rc{i} &= \W_q\x_\rc{i} & \k_\rc{i} &= \W_k\x_\rc{i} & \v_\rc{i} &= \W_v\x_\rc{i} \end{align*} $$

$$ \begin{align*} w'_{\rc{i}\gc{j}} &= {\q_\rc{i}}^T\k_\gc{j} \\ w_{\rc{i}\gc{j}} &= \text{softmax}(w'_{\rc{i}\gc{j}})\\ \y_\rc{i} &= \sum_\gc{j} w_{\rc{i}\gc{j}} \v_\gc{j}\p\\ \end{align*} $$

This gives the self-attention layer some controllable parameters, and allows it to modify the incoming vectors to suit the three roles they must play.

Notepad Icon Illustration of the self-attention with key, query and value transformations.

2) Scaling the dot product

The softmax function can be sensitive to very large input values. These kill the gradient, and slow down learning, or cause it to stop altogether. Since the average value of the dot product grows with the embedding dimension \(k\), it helps to scale the dot product back a little to stop the inputs to the softmax function from growing too large:

$$ w'_{\rc{i}\gc{j}} = \frac{{\q_\rc{i}}^T\k_\gc{j}}{\sqrt{k}} $$

Why \(\sqrt{k}\)? Imagine a vector in \({\mathbb R^k}\) with values all \(c\). Its Euclidean length is \(\sqrt{k}c\). Therefore, we are dividing out the amount by which the increase in dimension increases the length of the average vectors.

3) Multi-head attention

Finally, we must account for the fact that a word can mean different things to different neighbours. Consider the following example.

\[\bc{\text{mary}}, \bc{\text{gave}}, \bc{\text{roses}}, \bc{\text{to}}, \bc{\text{susan}}\]

We see that the word gave has different relations to different parts of the sentence. mary expresses who’s doing the giving, roses expresses what’s being given, and susan expresses who the recipient is.

In a single self-attention operation, all this information just gets summed together. The inputs $\x_\bc{\text{mary}}$ and $\x_\bc{\text{susan}}$ can influence the output $\y_\bc{\text{gave}}$ by different amounts, depending on their dot-product with $\x_\bc{\text{gave}}$, but they can’t influence it in different ways. If, for instance, we want the information about who gave the roses and who received them to end up in different parts of $\y_\bc{\text{gave}}$, we need a little more flexibility.

This leaves aside how we figure out who gave the roses. We can do that based on prior knowledge about Mary and Susan, encoded in the embeddings. We can also look at the order of the words, but we'll look at how to achieve that later.

We can give the self attention greater power of discrimination, by combining several self-attention mechanisms (which we'll index with \(\bc{r}\)), each with different matrices \(\W_q^\bc{r}\), \(\W_k^\bc{r}\), \(\W_v^\bc{r}\). These are called attention heads.

For input \(\x_\rc{i}\) each attention head produces a different output vector \(\y_\rc{i}^\bc{r}\). We concatenate these, and pass them through a linear transformation to reduce the dimension back to \(k\).

Notepad Icon Efficient multi-head self-attention.

The simplest way to understand multi-head self-attention is to see it as a small number of copies of the self-attention mechanism applied in parallel, each with their own key, value and query transformation. This works well, but for \(R\) heads, the self-attention operation is \(R\) times as slow.

It turns out we can have our cake and eat it too: there is a way to implement multi-head self-attention so that it is roughly as fast as the single-head version, but we still get the benefit of having different self-attention operations in parallel. To accomplish this, each head receives low-dimensional keys queries and values. If the input vector has $k=256$ dimensions, and we have $h=4$ attention heads, we multiply the input vectors by a $256 \times 64$ matrix to project them down to a sequence of 64 dimansional vectors. For every head, we do this 3 times: for the keys, the queries and the values.

Here is the whole process illustrated in one image.

Notepad Icon The basic idea of multi-head self-attention with 4 heads. To get our keys, queries and values, we project the input down to vector sequences of smaller dimension.

This requires $3h$ matrices of size $k$ by $k/h$. In total, this gives us $3hk\frac{k}{h} = 3k^2$ parameters to compute the inputs to the multi-head self-attention: the same as we had for the single-head self-attention.

The only difference is the matrix $W_o$, used at the end of the multi-head self attention. This adds $k^2$ parameters compared to the single-head version. In most transformers, the first thing that happens after each self attention is a feed-forward layer, so this may not be strictly necessary. I've never seen a proper ablation to test whether $W_o$ can be removed.

We can even implement this with just three $k \times k$ matrix multiplications as in the single-head self-attention. The only extra operation we need is to slice the resulting sequence of vectors into chunks.

To compute multi-head attention efficiently, we combine the computation of the projections down to a lower dimensional representation and the computations of the keys, queries and values into three $k \times k$ matrices.


In Pytorch: complete self-attention

Rainbow Divider

Let’s now implement a self-attention module with all the bells and whistles. We’ll package it into a Pytorch module, so we can reuse it later.

import torch
from torch import nn
import torch.nn.functional as F

class SelfAttention(nn.Module):
    def __init__(self, k, heads=4, mask=False):
        super().__init__()
        assert k % heads == 0
        self.k, self.heads = k, heads

        # These compute the queries, keys and values for all heads
        self.tokeys    = nn.Linear(k, k, bias=False)
        self.toqueries = nn.Linear(k, k, bias=False)
        self.tovalues  = nn.Linear(k, k, bias=False)

        # This will be applied after the multi-head self-attention operation.
        self.unifyheads = nn.Linear(k, k)

We can now implement the computation of the self-attention (the module’s forward function). First, we compute the queries, keys and values for all heads:

    def forward(self, x):
        b, t, k = x.size()
        h = self.heads
        queries = self.toqueries(x)
        keys    = self.tokeys(x)
        values  = self.tovalues(x)

This gives us three vector sequences of the full embedding dimension k. As we saw above we can now cut these into h chunks. we can do this with a simple view operation:

        s = k // h
        keys    = keys.view(b, t, h, s)
        queries = queries.view(b, t, h, s)
        values  = values.view(b, t, h, s)

This simply reshapes the tensors to add a dimension that iterations over the heads. For a single vector in our sequence you can think of it as reshaping a vector of dimension k into a matrix of h by k//h:

Next, we need to compute the dot products. This is the same operation for every head, so we fold the heads into the batch dimension. This ensures that we can use torch.bmm() as before, and the whole collection of keys, queries and values will just be seen as a slightly larger batch. Since the head and batch dimension are not next to each other, we need to transpose before we reshape. (This is costly, but it seems to be unavoidable.)

        # - fold heads into the batch dimension
        keys = keys.transpose(1, 2).contiguous().view(b * h, t, s)
        queries = queries.transpose(1, 2).contiguous().view(b * h, t, s)
        values = values.transpose(1, 2).contiguous().view(b * h, t, s)

You can avoid these calls to contiguous() by using reshape() instead of view() but I prefer to make it explicit when we are copying a tensor, and when we are just viewing it. See this notebook for an explanation of the difference.

As before, the dot products can be computed in a single matrix multiplication, but now between the queries and the keys.

        # Get dot product of queries and keys, and scale
        dot = torch.bmm(queries, keys.transpose(1, 2))
        # -- dot has size (b*h, t, t) containing raw weights
        # scale the dot product
        dot = dot / (s ** (1/2))
        # normalize
        dot = F.softmax(dot, dim=2)
        # - dot now contains row-wise normalized weights

We apply the self attention weights dot to the values, giving us the output for each attention head

        # apply the self attention to the values
        out = torch.bmm(dot, values).view(b, h, t, s)

To unify the attention heads, we transpose again, so that the head dimension and the embedding dimension are next to each other, and reshape to get concatenated vectors of dimension e. We then pass these through the unifyheads layer for a final projection.

        # swap h, t back, unify heads
        out = out.transpose(1, 2).contiguous().view(b, t, s * h)
        return self.unifyheads(out)

And there you have it: multi-head, scaled dot-product self attention. You can see the complete implementation here.

The implementation can be made more concise using einsum notation (see an example here).


Building transformers

Rainbow Divider

A transformer is not just a self-attention layer, it is an architecture. It’s not quite clear what does and doesn’t qualify as a transformer, but here we’ll use the following definition:

Any architecture designed to process a connected set of units—such as the tokens in a sequence or the pixels in an image—where the only interaction between units is through self-attention.

As with other mechanisms, like convolutions, a more or less standard approach has emerged for how to build self-attention layers up into a larger network. The first step is to wrap the self-attention into a block that we can repeat.


The transformer block

Rainbow Divider

There are some variations on how to build a basic transformer block, but most of them are structured roughly like this:

Notepad Icon

That is, the block applies, in sequence:

  • a self attention layer,
  • layer normalization,
  • a feed forward layer (a single MLP applied independently to each vector), and
  • another layer normalization.

Residual connections are added around both, before the normalization. The order of the various components is not set in stone; the important thing is to combine self-attention with a local feedforward, and to add normalization and residual connections.

Normalization and residual connections are standard tricks used to help deep neural networks train faster and more accurately. The layer normalization is applied over the embedding dimension only.

Here’s what the transformer block looks like in pytorch.

class TransformerBlock(nn.Module):
    def __init__(self, k, heads):
        super().__init__()

        self.attention = SelfAttention(k, heads=heads)
        self.norm1 = nn.LayerNorm(k)
        self.norm2 = nn.LayerNorm(k)
        self.ff = nn.Sequential(
            nn.Linear(k, 4 * k),
            nn.ReLU(),
            nn.Linear(4 * k, k)
        )

    def forward(self, x):
        attended = self.attention(x)
        x = self.norm1(attended + x)
        fedforward = self.ff(x)
        return self.norm2(fedforward + x)

We’ve made the relatively arbitrary choice of making the hidden layer of the feedforward 4 times as big as the input and output. Smaller values may work as well, and save memory, but it should be bigger than the input/output layers.


Classification transformer

Rainbow Divider

The simplest transformer we can build is a sequence classifier. We’ll use the IMDB sentiment classification dataset: the instances are movie reviews, tokenized into sequences of words, and the classification labels are positive and negative (indicating whether the review was positive or negative about the movie).

The heart of the architecture will simply be a large chain of transformer blocks. All we need to do is work out how to feed it the input sequences, and how to transform the final output sequence into a a single classification.

The whole experiment can be found here. We won’t deal with the data wrangling in this blog post. Follow the links in the code to see how the data is loaded and prepared.


Output: producing a classification

Rainbow Divider

The most common way to build a sequence classifier out of sequence-to-sequence layers, is to apply global average pooling to the final output sequence, and to map the result to a softmaxed class vector.

Under Construction New! Generate your own 90s page here! Under Construction