Resuming training leads to loss peak on different machine

hhoefener · March 3, 2023, 3:48pm

Hi there,
I am training an image classification network. When I train it for, say, 10 epochs, the train loss smoothly decreases as I would expect. However, when I save that model and load it on a different machine (but same hardware and software versions) and resume training for another 5 epochs, I get a massive peak in my train loss. If I load the same model on the same machine as the first training, the loss continues to decrease just smoothly.
Mysteriously, the resulting networks (always) perform significantly better on validation and test data when stopped/resumed on a different machine! (Compared to letting it train for the same number of epochs on one machine). This can be seen in the image below where I trained 50 epochs and resumed for another 25 epochs on another machine. Training loss returns to approx where it was but validation quality increases a lot!
Any ideas what could cause this? (And how to do this on purpose)

Details:

I set torch.backends.cudnn.benchmark = False and torch.use_deterministic_algorithms(True) and set the env “CUBLAS_WORKSPACE_CONFIG”: “:4096:8”
I did not set seeds, but do not think that missing seeding explains different behavior on different machines
Pytorch versions are equal (same docker image)
Resuming on same machines invokes the same code on a fresh docker container just as when resuming on the other machine

ptrblck · March 3, 2023, 9:31pm

I would recommend checking the data loading as often these issues are caused by a different dataset splits etc.
If that’s not the case, use a static input (e.g. torch.ones) and compare the output of the model on both systems after loading the state_dict and calling model.eval(). If these outputs differ something seems to fail while loading the state_dict.

hhoefener · March 9, 2023, 7:54am

Wow, that was spot on! Thank you very much!

TL;DR: I have managed to make myself dependent on the order of files given by glob, and this order changes between machines.

I first thought, c’mon, of course I checked my Dataset and Sampler classes, they are fine. But when I printed out every single sample that was drawn from it, I realized there were differences between machines. It took me quite a while to find out that the issue was indeed in the dataset splitting. I am creating a list of all images on disk, shuffle them with a seed and split them into train validation and test. Which worked perfectly fine until I started doing trainings on different machines. Apparently, glob returns files the same order on the same machine and in a different order on different machines. Re-doing the data split in the middle of training of course leads to far too good validation performance, as some of the validation samples have been in the training set before.