The Transformer Architecture
You have a sequence of words and you want to produce another sequence of words. Translation, summarization, question answering. The input and output are both variable-length sequences of tokens. A token is the atomic unit the model reads. It might be a whole word, a subword fragment, or a punctuation mark. The word "unfortunately" becomes three tokens: "un", "fortunate", "ly". A fixed tokenizer splits all text into these pieces before the model sees it.
Before 2017, the dominant approach to sequence-to-sequence modeling was recurrent neural networks. An RNN processes tokens one at a time, left to right, updating a hidden state at each step. The hidden state is a single vector that must encode everything the network has seen so far. For short sequences, this works. For long sequences, it fails. By the time the network reaches the end of a paragraph, information about the beginning has been compressed through dozens of sequential nonlinear transformations. Gradients vanish. Context decays. And because each step depends on the previous one, the computation cannot be parallelized.
Bahdanau et al. (2014)[0] introduced attention as an addition to the recurrent architecture. Instead of forcing the entire input through a single bottleneck vector, attention allows the decoder to look directly at every encoder state when producing each output token. The connection between any two positions is direct, regardless of their distance in the sequence.
Vaswani et al. (2017)[1] asked a natural question: if attention is doing the important work, why keep the recurrence? They removed it entirely and built a model from attention, feed-forward networks, and residual connections alone. No recurrence. No convolution. They called it the transformer.
The model is built from a number of components: embeddings, attention, positional encoding, feed-forward networks, and residual connections. We will go through each one, then assemble them into the full architecture.
Representing Words as Vectors
A neural network can add, multiply, and compare vectors. It cannot do any of these things with raw words. The word "cat" is a symbol. You cannot multiply it by a matrix or compute its distance from "dog." To use a neural network for language, every token must first be converted into a vector of numbers.
Start with a fixed vocabulary. In practice, this might contain 32,000 subword tokens. Each token is assigned an integer index. An embedding matrix \(E \in \mathbb{R}^{|V| \times d}\) stores one \(d\)-dimensional vector per token. Converting a token to its vector is a lookup: the token's index selects a row of \(E\). The matrix is learned during training. The network discovers, through gradient descent, which vector representations are useful for the task.
A sequence of \(n\) tokens becomes a matrix \(X \in \mathbb{R}^{n \times d}\). Row \(i\) of \(X\) is the embedding vector of the \(i\)-th token. The rest of the architecture operates on this matrix.
Two numbers define the geometry: \(n\), the sequence length, and \(d\), the model dimension. In the original transformer, \(d = 512\). These symbols persist for the rest of this post.
The fixed vocabulary has a consequence that extends beyond the input. Because the set of possible tokens is finite and known, the model can assign a probability to every one of them. At the output end of the transformer, a linear layer maps each decoder representation to a vector with one entry per token in the vocabulary. Softmax normalizes this vector into a probability distribution. The model produces a number between 0 and 1 for every token, and these numbers sum to 1. A finite vocabulary is what makes this possible.
Attention as a Soft Lookup
Each token now has a vector, but these vectors were computed in isolation. The tokens know nothing about each other. Attention is the mechanism that lets them communicate.
Consider a dictionary. You provide a query. The dictionary checks it against stored keys. When a key matches, the dictionary returns the associated value. This is a "hard" lookup: one key matches, one value is returned.
Attention is a soft version of the same operation. The query is compared against every key simultaneously. Instead of a single match, every key contributes to the output, weighted by how closely it matches the query. The result is a weighted sum of all values.
Each token plays all three roles. It produces a query (what information am I looking for?), a key (what information do I contain?), and a value (what do I return when selected?). These are computed by multiplying the token's embedding by three learned weight matrices.
The weights are not designed by hand. They are learned through gradient descent. The network discovers which queries, keys, and values are useful for the task at hand.
Scaled Dot-Product Attention
The previous section described what attention computes. This section describes how. We need to answer three questions: 1) how to measure similarity between a query and a key, 2) how to turn those similarity scores into weights that sum to 1, and 3) how to use those weights to combine values. Each answer is a single operation in linear algebra.
Each token's embedding vector \(\mathbf{x}_i\) contains all the information about that token, but it plays three different roles in attention: as a query (when this token is looking for information), as a key (when other tokens are checking whether this token is relevant), and as a value (when this token's information is being collected). A single vector cannot serve all three purposes well, so we project it into three separate spaces. Given the input matrix \(X \in \mathbb{R}^{n \times d}\), compute queries, keys, and values by multiplying \(X\) with three learned weight matrices:
\[Q = XW^Q, \quad K = XW^K, \quad V = XW^V\]
where \(W^Q, W^K \in \mathbb{R}^{d \times d_k}\) and \(W^V \in \mathbb{R}^{d \times d_v}\). This produces three matrices: \(Q, K \in \mathbb{R}^{n \times d_k}\) and \(V \in \mathbb{R}^{n \times d_v}\). Each has \(n\) rows, one per token. Row \(i\) of \(Q\), written \(\mathbf{q}_i\), is the query vector for token \(i\). Row \(j\) of \(K\), written \(\mathbf{k}_j\), is the key vector for token \(j\). Row \(j\) of \(V\), written \(\mathbf{v}_j\), is the value vector for token \(j\). These vectors are not fixed. They are different for every input sequence because they depend on \(X\), and they change during training because the weight matrices \(W^Q, W^K, W^V\) are updated by gradient descent.
Row \(i\) of \(Q\) is the query vector \(\mathbf{q}_i\). Column \(j\) of \(K^\top\) is the key vector \(\mathbf{k}_j\). Their dot product \(\mathbf{q}_i \cdot \mathbf{k}_j\) fills entry \(S_{ij}\): how much token \(i\) should attend to token \(j\). A large value means the query and key point in similar directions. A small or negative value means they do not. Computing all \(n \times n\) scores at once is a single matrix multiplication:
\[S = QK^\top\]
\(S\) is an \(n \times n\) matrix, and this is the computational bottleneck of the transformer. Every token computes a score against every other token, so both compute and memory scale quadratically with sequence length. A sequence of 1,000 tokens produces 1,000,000 scores. A sequence of 10,000 produces 100,000,000, and so on. This quadratic cost is why long sequences are expensive and why much subsequent research has focused on making attention more efficient.
These scores are then scaled by \(1/\sqrt{d_k}\):
\[S' = \frac{S}{\sqrt{d_k}}\]
Why scale? The dot product of two random vectors with \(d_k\) independent components, each with zero mean and unit variance, has expected value zero and variance \(d_k\). As \(d_k\) grows, the dot products grow in magnitude. Large inputs to softmax produce outputs that are nearly one-hot: almost all the probability mass concentrates on a single element. The gradients of softmax in this saturated regime are extremely small. Training stalls. Dividing by \(\sqrt{d_k}\) normalizes the variance back to 1, keeping the softmax in a regime where gradients flow and multiple tokens can contribute meaningfully to the output.
Apply softmax row-wise to convert scores into a probability distribution over keys:
\[A = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)\]
Each row of \(A \in \mathbb{R}^{n \times n}\) sums to 1. Entry \(A_{ij}\) is the weight that token \(i\) places on token \(j\). The output is the weighted sum of value vectors:
\[\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V\]
This is the complete attention function. Three matrix multiplications, a scaling operation, and a softmax. The output \(Z = AV \in \mathbb{R}^{n \times d_v}\) is a new representation of each token that incorporates information from the entire sequence, weighted by learned relevance.
Multi-Head Attention
A single attention head computes one set of weights. It learns one notion of relevance. Perhaps syntactic proximity. Perhaps semantic similarity. Perhaps coreference. But language requires multiple types of relationships simultaneously. A word can be syntactically close to one token, semantically related to another, and coreferent with a third.
Multi-head attention runs \(h\) attention heads in parallel. Each head \(i\) has its own learned projection matrices \(W_i^Q, W_i^K, W_i^V\) and operates on a reduced dimension \(d_k = d / h\). The outputs of all heads are concatenated and projected through a matrix \(W^O \in \mathbb{R}^{hd_v \times d}\):
\[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)\, W^O\]
where
\[\text{head}_i = \text{Attention}(XW_i^Q,\; XW_i^K,\; XW_i^V)\]
Each head produces one output vector per token, of length \(d_v = d/h\). Concat places the \(h\) output vectors for each token side by side, forming a single vector of length \(h \cdot d_v = d\). The matrix \(W^O\) then mixes information across heads, projecting this concatenated vector back to dimension \(d\).
With \(d = 512\) and \(h = 8\), each head produces an \(n \times 64\) matrix. Concatenation places all eight side by side into an \(n \times 512\) matrix, and \(W^O \in \mathbb{R}^{512 \times 512}\) projects it back to \(n \times 512\).
The total number of parameters is identical to a single attention head operating on the full dimension \(d\). Each head uses matrices of size \(d \times d_k\) instead of \(d \times d\), and there are \(h\) of them: \(h \cdot d \cdot d_k = h \cdot d \cdot (d/h) = d^2\). Splitting into \(h\) heads costs nothing in parameter count but gains diversity. Different heads learn to attend to different aspects of the input.
In the original transformer, \(d = 512\) and \(h = 8\), giving \(d_k = 64\) per head.
Positional Encoding
Attention has a property that matrix multiplication does not cure: it is permutation-invariant. If you shuffle the tokens in the input sequence, the attention output is the same shuffled set of vectors. The function itself has no notion of which token is first and which is last. The sentences "the cat sat on the mat" and "mat the on sat cat the" produce the same set of attention outputs, rearranged.
Word order matters. Position must be injected explicitly.
Vaswani et al.[1] add a positional encoding \(PE \in \mathbb{R}^{n \times d}\) directly to the input embeddings:
\[X' = X + PE\]
The encoding uses sine and cosine functions at geometrically increasing wavelengths:
\[PE(\text{pos}, 2i) = \sin\!\left(\frac{\text{pos}}{10000^{2i/d}}\right), \quad PE(\text{pos}, 2i+1) = \cos\!\left(\frac{\text{pos}}{10000^{2i/d}}\right)\]
Each pair of dimensions \((2i, 2i+1)\) corresponds to a sinusoid at a different frequency. The wavelengths form a geometric progression from \(2\pi\) to \(10000 \cdot 2\pi\). Low-frequency dimensions encode coarse position. High-frequency dimensions encode fine-grained position.
This choice has a useful property. For any fixed offset \(k\), \(PE(\text{pos} + k)\) can be written as a linear transformation of \(PE(\text{pos})\). The rotation matrix that maps position \(\text{pos}\) to position \(\text{pos} + k\) depends only on \(k\), not on \(\text{pos}\). This means the model can learn to attend to relative positions using a fixed, deterministic encoding. No additional parameters need to be trained.
Residual Connections and Layer Normalization
Two supporting mechanisms remain before we can assemble the full model.
The transformer stacks many identical blocks. Deep networks suffer from a well-known problem: adding more layers can make performance worse, not better. Gradients vanish or explode through long chains of nonlinear transformations. The signal degrades.
Residual connections[2] solve this. Instead of computing \(\text{output} = f(x)\), the network computes:
\[\text{output} = x + f(x)\]
The identity path \(x\) carries gradients directly from the output back to the input, regardless of depth. The sublayer \(f\) only needs to learn the residual: how much to change the representation, not the representation itself. This makes optimization dramatically easier.
Layer normalization[3] stabilizes training. For each token independently, it normalizes the activations across the feature dimension to have zero mean and unit variance, then applies learned scale and shift parameters \(\gamma\) and \(\beta\):
\[\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sigma} + \beta\]
where \(\mu\) and \(\sigma\) are the mean and standard deviation computed over the \(d\) dimensions of a single token's representation.
In the transformer, every sublayer (whether multi-head attention or feed-forward network) is wrapped with both mechanisms. Writing \(\text{Sublayer}(x)\) for whichever operation is being applied:
\[\text{LayerNorm}(x + \text{Sublayer}(x))\]
The combination is simple but essential. Without residual connections, gradients in a 6-layer transformer would need to pass through 12 nonlinear sublayers. With them, the shortest gradient path is a direct identity connection from output to input.
The Feed-Forward Network
At this point we have a mechanism that lets tokens talk to each other (attention) and a way to keep the network stable as it gets deeper (residual connections and layer normalization). What we do not yet have is a way to transform each token's representation individually.
Consider what attention actually computes. Each output token is a weighted sum of value vectors: \(\sum_j A_{ij} \mathbf{v}_j\). Given fixed weights, this is linear in the values. Softmax introduces nonlinearity in computing the weights, but attention acts primarily as a routing mechanism: it decides how much information to gather from each position. What it does not provide is a nonlinear transformation of the gathered information at each position. Routing can blend, average, and redistribute, but it cannot implement the conditional, context-dependent computation that language requires. A translation model, for instance, needs to look at the context gathered by attention and then decide which word to produce. That decision is inherently nonlinear: small changes in context can change the output completely.
The feed-forward network provides this nonlinearity. It is applied independently and identically to each token. No information flows between tokens in this step. Attention gathers; the feed-forward network processes.
The network is a two-layer MLP with a ReLU activation:
\[\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)\,W_2 + b_2\]
where \(W_1 \in \mathbb{R}^{d \times d_{ff}}\) and \(W_2 \in \mathbb{R}^{d_{ff} \times d}\). The inner dimension \(d_{ff}\) is typically much larger than \(d\). In the original transformer, \(d = 512\) and \(d_{ff} = 2048\), a 4x expansion. The first layer projects each token into a higher-dimensional space where the ReLU activation zeroes out roughly half the dimensions. This selective activation is the nonlinearity: different tokens activate different neurons. The second layer projects back to dimension \(d\).
The Encoder
The encoder's job is to read the entire input and build a representation where each token knows about every other token. It does this through repeated refinement. Each layer lets every token attend to the full sequence, then individually processes what it gathered. Stack enough layers, and the representation converges: each token's vector reflects the entire input, weighted by learned relevance.
The encoder is a stack of \(N\) identical blocks (\(N = 6\) in the original transformer). Each block contains two sublayers:
1. Multi-head self-attention
2. Position-wise feed-forward network
Each sublayer is wrapped with a residual connection and layer normalization.
"Self-attention" means the queries, keys, and values all come from the same sequence. Every token attends to every other token in the input, including itself. There is no restriction on which positions can communicate.
The input enters as \(X' = X + PE\) (embeddings plus positional encoding). After \(N\) blocks of self-attention and feed-forward processing, the output is a matrix of the same shape: \(\mathbb{R}^{n \times d}\). The dimensionality is unchanged, but the representations have been refined through six rounds of contextual mixing.
The Decoder
The encoder can see the entire input at once. The decoder cannot. It must generate the output sequence one token at a time, conditioning each prediction on what it has produced so far and on what the encoder learned about the input. This asymmetry (full visibility on the input side, sequential generation on the output side) is what shapes the decoder's structure.
The decoder is a stack of \(N\) identical blocks, each with three sublayers:
1. Masked multi-head self-attention
2. Multi-head cross-attention
3. Position-wise feed-forward network
Each sublayer is wrapped with a residual connection and layer normalization, just as in the encoder.
Masked self-attention. During training, the decoder sees the entire target sequence at once for parallelism. But each position should only attend to earlier positions, not future ones. Otherwise the model could simply copy the answer instead of learning to predict it. The mask sets all entries above the diagonal of the score matrix \(S\) to \(-\infty\) before softmax. After softmax, these entries become zero. Token \(i\) can only attend to tokens \(1\) through \(i\).
Cross-attention. This is how the decoder reads the encoder's output. The queries come from the decoder's current representations. The keys and values come from the encoder's final output. Each decoder token can attend to every encoder token, selecting the source information relevant for producing the next output token. The mechanism is identical to self-attention, except that \(K\) and \(V\) originate from a different sequence than \(Q\).
Autoregressive generation. At inference time, the decoder produces tokens one at a time. It begins with a start-of-sequence token, generates a probability distribution over the vocabulary, selects the most likely next token (or samples from the distribution), appends it to the input, and repeats. Each step conditions on all previously generated tokens through the masked self-attention. Generation terminates when the model produces an end-of-sequence token or reaches a maximum length.
The Complete Architecture
Everything assembles into a single model:
1. The source sequence is embedded and summed with positional encoding.
2. The result passes through \(N\) encoder blocks.
3. The target sequence, shifted right by one position, is embedded and summed with positional encoding. "Shifted right" means a start-of-sequence token is prepended, so the token at each position is the one the model should have predicted at the previous step. If the target is "Je suis étudiant \(\langle\text{end}\rangle\)", the decoder receives "\(\langle\text{start}\rangle\) Je suis étudiant". The decoder never sees the token it is trying to predict (it always predicts one step ahead).
4. The result passes through \(N\) decoder blocks, with cross-attention connected to the encoder output.
5. A linear layer projects the decoder output to the vocabulary size.
6. Softmax produces a probability distribution over the next token.
The original transformer uses \(N = 6\), \(d = 512\), \(h = 8\), \(d_k = d_v = 64\), and \(d_{ff} = 2048\). The total parameter count is approximately 65 million.
The architecture has since split into three families. Encoder-only models such as BERT[4] use the encoder with bidirectional self-attention, trained with masked language modeling. They are effective for tasks that require understanding the complete input: classification, entity recognition, semantic similarity. Decoder-only models such as GPT[5] use the decoder with causal masking, trained to predict the next token. They are effective for generation: text completion, dialogue, code synthesis. The original encoder-decoder form persists in sequence-to-sequence tasks: translation, summarization, speech recognition.
The specific architecture varies, but the components are the same. Self-attention, feed-forward networks, residual connections, layer normalization, positional encoding. The transformer's lasting contribution is not any single configuration but a set of composable primitives that scale.
For visual explanations, 3Blue1Brown's videos on attention and how transformers use it are excellent.