Fine-Tuning a Sentiment Classifier in 50 Lines

You have 50,000 movie reviews. Half are positive, half negative. You need a classifier that determines sentiment. Training from scratch takes days and costs thousands. Fine-tuning a pretrained model takes three epochs on a single Nvidia T4 GPU in under thirty minutes. That's for a small model (fine-tuning a 70B parameter model takes orders of magnitude more compute and time).

This is transfer learning. You take knowledge learned on one task and apply it to another. Someone paid for pretraining on billions of tokens. You take their model and continue training on your specific task. Fine-tuning is how you do it. The model already understands language. You teach it what positive and negative sentiment look like.

Terminology matters. Fine-tuning is not post-training. Post-training turns base models into assistants through instruction tuning and RLHF. Fine-tuning specializes pretrained models for specific downstream tasks.

Let's build it.

The Setup

We're using DistilBERT[0], a smaller and faster version of BERT[1] that retains 97% of its language understanding while being 40% smaller and 60% faster. The dataset is Stanford's IMDB reviews[2]: 25,000 training examples, 25,000 test examples, perfectly balanced between positive and negative sentiment. You could split the training set into train and validation for hyperparameter tuning, or use k-fold cross-validation[3] for more robust evaluation. We keep it simple here.

First, imports and hyperparameters:

import torch
import pandas as pd

from tqdm.auto import tqdm
from torch.utils.data import DataLoader
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.optim import AdamW

MODEL_NAME = "distilbert-base-uncased"
BATCH_SIZE = 16
EPOCHS = 3
MAX_LEN = 256
LR = 2e-5

Batch size of 16 fits on an Nvidia T4 GPU with 15 GB of RAM. Learning rate of 2e-5 works for this use case. Go too high and you destroy the pretrained weights. Too low and training crawls. Both are hyperparameters you tune based on your task, dataset size, and how much the target domain differs from pretraining.

Data Preparation

Load the IMDB dataset and tokenize using HuggingFace's transformers library[4]. The tokenizer converts text into input IDs that the model understands. Truncation ensures all sequences fit within 256 tokens. Padding handles variable-length reviews.

ds = load_dataset("stanfordnlp/imdb")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def tok(batch):
    return tokenizer(
        batch["text"], 
        truncation=True, 
        padding=True, 
        max_length=MAX_LEN
    )

ds = ds.map(tok, batched=True, remove_columns=["text"])
ds = ds.rename_column("label", "labels")
ds.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

train_loader = DataLoader(ds["train"], batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(ds["test"], batch_size=BATCH_SIZE)

The map function processes the entire dataset in batches, faster than looping through examples one at a time. We rename label to labels because that's what HuggingFace's models expect. Setting the format to torch returns PyTorch tensors instead of lists.

Model and Optimizer

Load the pretrained model and add a classification head. DistilBERT comes with encoder layers that output contextualized representations. The classification head is a small feedforward network that maps these representations to two classes: positive and negative.

model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME, 
    num_labels=2
).to(device)

optim = AdamW(model.parameters(), lr=LR)

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optim,
    T_max=EPOCHS,
    eta_min=0.0
)

AdamW[5] is Adam with decoupled weight decay. It's the standard optimizer for transformers. The cosine annealing scheduler[6] gradually reduces the learning rate from 2e-5 to near zero over three epochs. This helps the model converge without overshooting at the end of training.

Validation Function

Before training, define how we measure performance. Accuracy is sufficient here: what percentage of reviews did we classify correctly?

def validate(model, val_loader, device):
    model.eval()
    correct = 0
    total = 0
    
    with torch.no_grad():
        for batch in val_loader:
            labels = batch["labels"].to(device)
            inputs = {k: v.to(device) for k, v in batch.items() if k != "labels"}
            logits = model(**inputs).logits
            preds = logits.argmax(dim=-1)
            correct += (preds == labels).sum().item()
            total += labels.numel()
    
    return correct / total

Setting model.eval() disables dropout and batch normalization updates. The torch.no_grad() context prevents gradient computation, saving memory. We pass inputs through the model, take the argmax of the logits to get predictions, and count how many match the true labels.

Training Loop

Three epochs. Each epoch runs through all 25,000 training examples, computes loss, backpropagates gradients, and updates weights. After each epoch, validate on the test set.

for epoch in tqdm(range(1, EPOCHS + 1), desc="Epochs"):
    model.train()
    total_loss = 0.0

    batch_pbar = tqdm(train_loader, desc=f"Train (epoch {epoch}/{EPOCHS})", leave=False)
    for batch in batch_pbar:
        batch = {k: v.to(device) for k, v in batch.items()}
        out = model(**batch)
        loss = out.loss

        optim.zero_grad(set_to_none=True)
        loss.backward()
        optim.step()

        total_loss += loss.item()
        batch_pbar.set_postfix(loss=f"{loss.item():.4f}")

    scheduler.step()

    avg_loss = total_loss / len(train_loader)
    test_acc = validate(model, test_loader, device)
    lr_now = scheduler.get_last_lr()[0]

    tqdm.write(
        f"Epoch {epoch}/{EPOCHS} | "
        f"loss={avg_loss:.4f} | "
        f"test_acc={test_acc:.4f} | "
        f"lr={lr_now:.6g}"
    )

Each batch flows through the model, producing logits and a loss. HuggingFace models compute cross-entropy loss internally when you pass labels. We zero gradients with set_to_none=True instead of the default zero_grad() because it's slightly faster. Then backprop, optimizer step, repeat.

After all batches, the scheduler reduces the learning rate. We validate, print metrics, and move to the next epoch.

What You Get

After three epochs, expect around 91-93% test accuracy depending on random seed and initialization. The first epoch typically hits 88-91%, then small incremental improvements. Training time on a T4 GPU is roughly 30 minutes.

Why This Matters

Ten years ago, this would have required a PhD, a research lab, and weeks of compute. Now it's fifty lines and thirty minutes. Transfer learning collapsed the barrier between having an idea and testing if it works. The hard part isn't getting 91%+ accuracy. It's knowing whether that solves your problem or just automates the wrong thing faster.