Why is Pytorch faster after the first training (starting loss also lower)?

I notice that Pytorch has slow training process when I train the model for the first time. But the next time the speed is significantly improved. This makes me think that some caching is done after first training for faster inference or data loading in Pytorch.

One important thing I notice is that the starting loss of later training is much lower than the starting loss in the first training. So I wonder if Pytorch has done something with the weight initialization?

I hope somebody will explain to me this kind of phenomenon so I don’t have to worry about my training code.

When you say “the next time”, do you mean the next epoch? That would explain exactly what you describe (faster computational time, lower loss)…

After the first epoch, if your dataset fits in memory, the loading times will be much faster if you are on Linux, because of the system caching system.

However, if you are talking about launching a new experiment with the same script, there is no justification for the much lower loss or faster loading times that I can think of…

Yeah, I’m talking about launching a new experiment with the same script. It’s weird for me because I have observed this phenomenon many times. Sometimes the starting loss is not lower, but sometimes it is. The decrease is so significant that I don’t think the random weight initialization is the cause. The only thing that I can think of is that the Pytorch is saving some state after the last training session and loading it into the new one. But it seems impossible.

I don’t know much about PyTorch’s internal behaviour, but I’ve never observed this. Even if you specifically load the previous model in the code, it doesn’t explain the big change in initial performance.

One thing that could explain it is that your dataset has very easy samples and very hard samples, and your dataset is shuffled at each training session, therefore the starting batches could include only or mostly hard samples and in other cases only or mostly easy samples… Do you use shuffle in your DataLoader and could you try several experiments (just stop them after a few batches) without shuffle if you do? That could verify this hypothesis.

1 Like

Looks like you’ve got it right. I think it is because the loss function I use is a square function so if the beginning batches are hard batches the loss will be extremely high and vice versa for the easy batches. Also, my data is not balanced so maybe it causes the oscillation in the starting loss.

1 Like

In my case, what happens is that pytorch only uses a core, then starts using all the cores that I have.