Where Fine-Tuning Rewrites DistilBERT
What is Centered Kernel Alignment (CKA)?
In a previous post, a linear probe showed that a frozen DistilBERT already exposes about 86% of IMDB sentiment, and that fine-tuning adds roughly six points on top. That left a question hanging. Fine-tuning updates all 66 million weights, but does it change the whole network or only part of it?
The probe measured how much. It could not say where. To localize the change we need a way to compare two models layer by layer. That tool is Centered Kernel Alignment, or CKA.
The short answer: fine-tuning barely touches the bottom of the network and rewrites the top. The embedding layer is essentially identical before and after. The final block is almost unrecognizable. And the layers that change most are exactly the layers where decodability improves most.
What CKA measures
Run the same set of reviews through two models. At a chosen layer, each model produces a cloud of activation vectors, one per review. CKA asks whether the geometry of those two clouds matches: do reviews that sit close together in one representation stay close in the other? It answers with a single number in \([0, 1]\). A value of 1 means the two representations are identical up to a rotation and a global rescaling. A value of 0 means they are unrelated.
The invariances are the whole point. CKA is invariant to two transformations that should not count as a real change: rotating the representation (an orthogonal transform) and rescaling it (isotropic scaling). It is deliberately not invariant to every invertible linear map. Kornblith and colleagues showed that any index with that stronger invariance cannot meaningfully compare representations of higher dimension than the number of examples, and fails to line up layers across networks trained from different seeds[1]. CKA gives up the extra invariance to gain the ability to compare.
One framing to keep straight. CKA measures whether the representational geometry changed. It does not measure whether the two layers compute the same function, nor task performance, nor whether the model actually uses the representation. It is a similarity of structure, not of mechanism.
The math
This section works through where CKA comes from. If you only want the results, skip to the setup and figures below; nothing later depends on following the derivation.
Before any formula, the objects CKA operates on. Both ingredients are matrices, but they are not the activation matrices themselves; they are similarity matrices built from them.
The Gram matrices \(K\) and \(L\). We are comparing two representations of the same reviews: here, a layer of the pretrained DistilBERT against the corresponding layer of the fine-tuned one. Take \(N\) reviews and run them through the first of the two (say, the pretrained model) at a fixed layer. Each review becomes one activation vector. Now form an \(N \times N\) matrix \(K\) whose entry \(K_{ij}\) is the similarity between the representations of review \(i\) and review \(j\). With a linear kernel that similarity is just the dot product of their two activation vectors. So \(K\) records the full pairwise-similarity structure of the representation: how every review relates to every other one. This is called the Gram matrix. The matrix \(L\) is the exact same construction applied to the second representation (the fine-tuned model). The key detail is that \(K\) and \(L\) are indexed by examples, not by neurons. They are both \(N \times N\) regardless of how wide each model's hidden layer is, which is what lets us compare two representations of different dimension on equal footing.
HSIC. The Hilbert-Schmidt Independence Criterion measures how much two similarity structures agree[2]. Intuitively, \(\text{HSIC}(K, L)\) is large when pairs of reviews that look similar under the pretrained model (\(K\)) also look similar under the fine-tuned one (\(L\)), and small when the two notions of similarity are unrelated. Gretton and colleagues introduced it as a statistical test of independence: it is zero when the two sets of variables are independent and grows as their structures line up. Concretely, it centers both matrices and correlates them,
\[ \text{HSIC}(K, L) = \frac{1}{(N-1)^2}\,\operatorname{tr}(K H L H), \qquad H = I_N - \tfrac{1}{N}\mathbf{1}\mathbf{1}^\top, \]
where the centering matrix \(H\) subtracts row and column means (this is the "centered" in centered kernel alignment), and the trace adds up how much the two centered similarity matrices overlap. So HSIC is essentially an unnormalized correlation between the two pairwise-similarity patterns.
HSIC has one inconvenient property: it is not invariant to isotropic scaling, so multiplying one representation by a constant changes the number. We fix that by normalizing, which gives CKA:
\[ \text{CKA}(K, L) = \frac{\text{HSIC}(K, L)}{\sqrt{\text{HSIC}(K, K)\,\text{HSIC}(L, L)}}. \]
The denominator divides out each representation's "self-similarity," turning HSIC into a correlation-like quantity bounded in \([0, 1]\). This normalized index is exactly the centered kernel alignment of Cortes and colleagues[3]. With a linear kernel it collapses to a compact expression directly in the feature matrices \(X, Y \in \mathbb{R}^{N \times d}\), after centering each column:
\[ \text{CKA}_{\text{lin}}(X, Y) = \frac{\lVert X^\top Y \rVert_F^2}{\lVert X^\top X \rVert_F \, \lVert Y^\top Y \rVert_F}. \]
We use the linear kernel. Kornblith and colleagues report that it agrees with the RBF kernel across most experiments, and this form avoids ever building the \(N \times N\) Gram matrices when \(N \gg d\), which is our regime (thousands of reviews, 768 dimensions).
We report two views of the comparison. The diagonal compares each layer to itself across the two checkpoints,
\[ s_\ell = \text{CKA}\big(h_\ell^{\text{pre}},\, h_\ell^{\text{ft}}\big), \]
so \(s_\ell\) near 1 means fine-tuning left layer \(\ell\) intact, and a lower value means that layer was rewritten. The full matrix compares every pretrained layer against every fine-tuned layer,
\[ S_{\ell,\ell'} = \text{CKA}\big(h_\ell^{\text{pre}},\, h_{\ell'}^{\text{ft}}\big), \]
which tells us whether information stayed at the same depth or moved somewhere else in the stack.
Setting up the comparison
The setup mirrors the probe post so the two read together. DistilBERT[4] exposes seven readable layers: layer 0 is the embedding lookup, layers 1 through 6 are the outputs of six transformer blocks. We read all seven and, as before, mean-pool each layer's token vectors into one vector per review.
We compare two checkpoints of the same architecture so that layer indices correspond one to one: a pretrained distilbert-base-uncased (checkpoint A) and a version fine-tuned on IMDB sentiment[5] (checkpoint B). Both are frozen; we only read activations.
Alignment is mandatory. CKA compares the two clouds example by example, so both models must see the identical reviews in the identical order. Feed them different inputs and the number is meaningless. We draw a balanced subset of 2,000 IMDB reviews[6], tokenize it once, and run that exact batch through both checkpoints.
One note on the probe numbers later in this post: they are computed on this smaller 2,000-review set, so the absolute accuracies sit a little below the full-dataset probe in the previous post. The quantity that matters here is the per-layer gap fine-tuning opens up, not the absolute level.
The plot
This is the diagonal CKA, layer by layer.
The curve starts pinned at the top. The embedding layer has \(s_0 = 1.00\): mean-pooled token embeddings are essentially untouched by fine-tuning. Through the lower blocks the similarity erodes gently, to 0.97, then 0.90, then 0.84. Then it falls off a cliff. Layer 5 drops to 0.27 and the final block to 0.16. The top of the fine-tuned network is a substantially new representation, not a tweak of the pretrained one.
The cross-layer matrix
So far we have only compared matching layers. The full matrix checks whether the fine-tuned top layer still looks like any pretrained layer, even one lower in the stack.
The upper-left block is uniformly bright: pretrained layers 0 through 4 stay similar to fine-tuned layers 0 through 4, with the strongest matches on or just next to the diagonal. The last two columns are dark across every row. The fine-tuned top layers do not resemble any pretrained layer, at any depth. So fine-tuning did not relocate information up or down the stack. It built something new at the top while leaving the foundation in place.
CKA next to the probe
CKA says where the representation changed. The linear probe says whether the change helped the task. Putting them in the same table is the point of running both.
| Layer | CKA \(s_\ell\) | Pretrained acc | Fine-tuned acc | Δ acc | Layer status |
|---|---|---|---|---|---|
| 0 | 0.999 | 0.773 | 0.787 | +0.013 | untouched |
| 1 | 0.965 | 0.765 | 0.775 | +0.010 | refined |
| 2 | 0.901 | 0.762 | 0.795 | +0.033 | refined |
| 3 | 0.835 | 0.768 | 0.812 | +0.043 | rewritten |
| 4 | 0.779 | 0.780 | 0.860 | +0.080 | rewritten |
| 5 | 0.267 | 0.815 | 0.878 | +0.063 | rewritten |
| 6 | 0.158 | 0.798 | 0.887 | +0.088 | rewritten |
Read the two right-hand columns together. Where CKA stays near 1 (layers 0 and 1), fine-tuning adds almost nothing to decodability, about a point. Where CKA collapses (layers 4 through 6), the probe gains the most, six to nine points. Change and improvement co-locate at the top of the network. The cliff in the CKA curve and the steepest probe gains land on the same layers.
This is the cleanest version of a two-question diagnostic. Did the layer change? is CKA. Did the change help? is the probe. The layers worth caring about are the ones that answer yes to both.
The pattern matches what others have found for BERT-family models: fine-tuning is concentrated in the upper layers, while lower layers, which carry general lexical and syntactic structure, are largely preserved[7][8].
What CKA does not tell you
CKA is a similarity of representations, not of computation. A low value tells you the geometry moved, but not why, and not whether the move was good. The probe is what supplies that second judgment; CKA alone would leave the top-layer collapse ambiguous.
Two more caveats. Linear CKA is one kernel choice; an RBF kernel can give different numbers, though in practice they tend to agree[1]. And CKA depends on the input distribution and the pooling: these results describe mean-pooled representations on IMDB reviews, and a different probe set or pooling could shift the curve. Earlier similarity indices such as SVCCA were the motivation for CKA precisely because they behaved badly under these conditions[9].
Why this matters
If fine-tuning only rewrites the top of the network, you do not need to pay for gradients on the bottom. Freezing the lower layers, or using a low-rank adapter that only nudges the upper ones, should recover most of the benefit at a fraction of the cost. The probe told us how much signal was already present; CKA tells us where to spend the updates to improve it.
Run the probe to see how much. Run CKA to see where.