Why validation loss gets 'lost" for a while when validating on the same data set

I have a model that I’ve been slowly hyperparameter tuning with a large dataset (thousands of subjects). And seems to work just fine learning to generalize on unseen data.

But, for fun, today I tried training on just 1 or 10 subjects and setting the validation set to be the same as the training set (same 1 or 10 subjects)—to see if it could memorize the data. Obviously, the loss should be the same. However, that’s not what I see. The training and validation loss are identical before training. Then, for the first few epochs they both decrease, but diverge. Then, around 10 epochs in, the validation loss will reverse and quickly shoot up to its maximum value. And sometime around the first or second time the learning rate is halved (~100 epochs in), the validation loss will begin to decrease again, eventually matching the training loss after a few hundred more epochs.

I can’t for the life of me figure out why this is occurring. I use the same loop for train/validation (with the optimizer.zero_grad(), loss.backward(), and optimizer.step() behind a conditional).

Any pointers?

I suspect this is due to either BatchNorm or Dropout. Although, I suspect BatchNorm is more likely. Maybe it takes a very long time to compute the running stats, or maybe the layers are changing so quickly.