What a Frozen DistilBERT Already Knows
What are Linear Probes?
Imagine you have a pretrained language model and want to use it for a new task, like classifying movie reviews as positive or negative. The standard recipe is to fine-tune: continue training the model on your dataset until it performs the task. In a previous post, DistilBERT on the full IMDB dataset was expected to land around 91-93% test accuracy after three epochs. I use 92% as a representative reference point here.
That number leaves something unsaid. Fine-tuning updates the weights of a 66-million-parameter model, which is a heavy intervention. How much of that performance requires those updates, and how much is already linearly readable from the frozen model?
Most of it is already linearly decodable. If we freeze DistilBERT and train only a logistic regression on top of its activations, we reach 85.9% on the same test set. Fine-tuning closes roughly a six-point gap on top. The tool that measures this is called a linear probe, and the rest of this post is about what it is and what it tells us.
What pretraining quietly leaves behind
To interpret the probe, we first need to understand why a pretrained model knows anything about sentiment in the first place. DistilBERT was trained with a combination of masked language modeling, distillation from BERT, and a cosine loss that aligns hidden states with the teacher model. Along the way, it picks up something else for free.
To predict a masked word, the model has to learn which words appear in which contexts. The words terrible, awful, and boring appear in similar sentences, so the model assigns them similar embedding vectors. The same is true for masterpiece, brilliant, and stunning. Words sort themselves in vector space by meaning, and sentiment, which is mostly a property of word identity, ends up represented in the geometry of the embedding table.
So when we later fine-tune the model for sentiment classification, we are not teaching it sentiment from scratch. We are nudging an existing structure into a form that lines up with our labels. The natural question is how much signal is already available before that nudging. A linear probe is the way to measure it.
The probe
A linear probe is the simplest classifier we can place on top of a frozen model. We pick a layer, run a forward pass for each input, read off the activations at that layer, and train a logistic regression on those activations to predict our labels. The model's weights are never updated.
In equations, let \(h_\ell(x) \in \mathbb{R}^{768}\) be the activation vector that DistilBERT produces at layer \(\ell\) when given input \(x\). The probe predicts
\[ \hat{y} = \sigma\!\left( w^\top h_\ell(x) + b \right), \]
where \(w \in \mathbb{R}^{768}\) and \(b\) are the parameters learned during probe training. DistilBERT produces \(h_\ell(x)\), but its weights stay fixed. The only thing training changes is the linear boundary placed on top of those frozen activations. The probe has 769 learned parameters; DistilBERT has 66 million frozen ones. The probe sees only the activations, never the model weights.
If the probe achieves high accuracy, the labels are linearly decodable from the activations. Geometrically, there exists a hyperplane in 768-dimensional space that separates many positive reviews from many negative reviews. The representation already contains sentiment-relevant directions before task-specific fine-tuning.
The choice of a linear classifier matters. A deep classifier could perform its own computation on the activations and recover information the model itself never expressed. A linear classifier cannot. It only sees what a single weighted sum can express. So a successful linear probe is a statement about the model's representation, not about the cleverness of the probe.
This connects to an idea called the linear representation hypothesis: high-level concepts are encoded as approximately linear directions in activation space. Park, Choe, and Veitch made it precise in 2024[1]. Linear probes are the natural way to test it. The technique itself goes back to Alain and Bengio in 2016[0], who used it to understand what convolutional networks learn at different depths.
One caveat upfront. A linear probe measures whether information is decodable from a representation. It does not measure whether the model uses that information. The model could carry sentiment in its activations and never read it. We return to this near the end.
Setting up the experiment
For the probe to be informative, we need to be careful about three things: which layers we read, how we pool the activations, and which model we are probing.
DistilBERT[2] processes a sentence into a sequence of token vectors and exposes seven layers we can read from. Layer 0 is the raw embedding lookup, where each token has been replaced by its embedding vector and nothing else has happened. Layers 1 through 6 are the outputs of six transformer blocks stacked on top, each applying attention and a feed-forward network. We probe all seven, so we can see how linear decodability evolves with depth.
Each layer produces one vector per token, but our probe needs one vector per sentence, so we pool. The default choice in the BERT family is to use the activation at the first position, called the \([\text{CLS}]\) token. We avoid this for two reasons. First, DistilBERT was not trained with BERT's next-sentence prediction objective, so \([\text{CLS}]\) was not given that sentence-level auxiliary role during pretraining. Second, at layer 0 the \([\text{CLS}]\) activation is literally a constant (the \([\text{CLS}]\) word embedding plus position-0 embedding, with no dependence on the input), and a probe on a constant cannot learn anything.
Instead, we average across token positions. If a sentence has \(T\) tokens and \(h_\ell^{(t)}(x)\) is the activation at position \(t\) of layer \(\ell\), the pooled representation is
\[ \bar{h}_\ell(x) = \frac{1}{T} \sum_{t=1}^{T} h_\ell^{(t)}(x). \]
This is a bag-of-words view: the order of tokens is collapsed, but their content is averaged. For a task like sentiment, which depends mostly on which words appear, it is a sensible default.
The dataset matches the fine-tuning post: 25,000 IMDB reviews[3] for training and 25,000 for testing, perfectly balanced. The probe is logistic regression with L2 regularization, fit with a standard convex solver[4]. The one knob, the regularization strength, is chosen once on a held-out validation slice and reused for every probe. No layer gets to tune its own regularizer.
To understand what pretraining contributes, we need something to compare it against.
A control: the same model with random weights
A pretrained DistilBERT contains two distinct sources of learned information: the embedding table, which assigns each word a vector, and the stack of transformer blocks, which turns those vectors into context-sensitive representations. We want to compare the effect of each.
The natural comparison is a randomly initialized DistilBERT: same architecture, weights drawn from the default initialization, never trained. The gap between the pretrained model and this control estimates what learned weights add on top of the architecture and tokenizer.
What accuracy should we expect from this random model? The naive guess is 50%. The actual answer is much higher, and the reason matters.
At layer 0, mean-pooled random embeddings behave like a random projection of token counts. The Johnson-Lindenstrauss lemma says that high-dimensional random projections can approximately preserve distances, so lexical separability can survive this projection. The later random transformer blocks are not just linear projections: they include attention, nonlinearities, layer normalization, and residual connections. Even so, they can leave enough lexical signal for a linear probe to recover.
The random-init baseline is therefore not measuring "the model knowing nothing." It measures how much sentiment is already separable from token identity after random embedding and random mixing. The gap to the pretrained model is a useful estimate of what pretraining contributes on top of the architecture.
To make the comparison tight, we fit each probe five times on bootstrap resamples of the training data and report mean and standard deviation. With 25,000 examples, the standard deviation comes out to roughly 0.002, invisible at the scale of the plot. That is the outcome we want.
The plot
This is the result.
Read it from top to bottom along the right edge. A representative fine-tuned reference sits at 92% (the plum dashed line). The pretrained DistilBERT, probed at its final layer, reaches 85.9%. The same probe on a random-init DistilBERT reaches 75.1%. Chance, the slate dotted floor, is 50%.
The pretrained curve rises gently from 82.2% at layer 0 to 85.9% at layer 6. The random-init curve declines from 77.4% at layer 0 to 75.1% at layer 6. Both curves are essentially straight, with no sharp transitions. The story the plot tells is gradual, not dramatic.
The next step is to compare the gaps between these reference levels.
What the gaps show
These gaps are comparisons, not a causal budget. The embedding comparison uses layer 0. The transformer comparison follows the pretrained curve from layer 0 to the top.
| Reference level | Accuracy | Gap |
|---|---|---|
| Chance | 50.0% | |
| Random embeddings (no transformer) | 77.4% | +27.4 |
| Pretrained embeddings (no transformer) | 82.2% | +4.8 |
| Pretrained transformer, top layer | 85.9% | +3.7 |
| Fine-tuned (gradient updates on all weights) | ~92.0% | +6.1 |
The largest gap is the first one: token identity, projected through random embeddings and averaged over a review, already gets 27 points above chance. Pretrained embeddings add about 5 points. The pretrained transformer stack adds about 4 more. A typical fine-tuned run sits about 6 points above the frozen top-layer probe.
The random transformer is not a positive step here. Six untrained blocks move accuracy from 77.4% to 75.1%. The short version: IMDB sentiment is strongly lexical; pretraining refines that signal; fine-tuning improves it further.
Why the curves look the way they do
Layer 0 is strong because sentiment is lexical. Words like terrible and brilliant appear in different contexts, so pretrained embeddings already place them in useful regions of space. Averaging those embeddings over a review gives the probe a strong feature.
The pretrained curve climbs near the top because upper transformer layers can integrate context that pure word averaging misses. Tenney, Das, and Pavlick observed a related pattern in BERT in 2019[5]: lexical features at the bottom, syntactic features in the middle, semantic features near the top.
The random-init curve declines because random depth is not free. Attention, layer normalization, nonlinearities, and residual mixing can blur a useful lexical signal even if much of it survives. Pretraining turns that stack from a small cost into a benefit.
What probes do not tell you
A probe is a measurement tool, not a causal one. It finds information that is decodable from activations. It does not show that the model uses that information when producing an output. Belinkov's 2021 survey is the standard reference on this distinction[6].
To make a causal claim, you intervene on the activations and watch the output change. If erasing the probe direction changes the model's behavior, the model uses it. If not, the direction was decodable but unused.
Two smaller caveats remain. A probe can find dataset correlates rather than the intended concept. And not every feature is linear: Engels and colleagues showed in 2024 that some concepts are encoded as rotations rather than single directions[7]. For this experiment, the numbers still measure what they claim: how much IMDB sentiment is linearly decodable at each layer.
Why this matters
Without a probe, it is easy to attribute the whole 92% result to fine-tuning because fine-tuning is the visible step. The probe shows that 86% is already linearly readable from the frozen model. The remaining gap matters, but it is smaller than the signal already present before task-specific training.
The practical rule is simple: before fine-tuning, run the probe. If it is already high, ask whether fine-tuning is worth the extra compute. If it is near chance, test whether the frozen representation exposes what the task needs. A probe is not a discovery. It is a measurement you make before committing to gradient updates.