Results strongly vary changing torch.manual_seed() for small system

erm · May 4, 2020, 1:19pm

Hi, I am trying to reproduce a NN in Python, using PyTorch. I set in my code np.random.seed(3) and torch.manual_seed(3) (the numpy seed being the same as the one used for the NN without Pytorch, and the torch seed, being whatever value).
I know the results are different with different initial seed because the optimizer will start in a different point, but the differences are huge. If I choose torch.manual_seed(1) the train and test accuracy are 65% and 34%, respectively. If I choose torch.manual_seed(3), they are instead 97% and 68%. My datasets consist on a training set of only 209 images and 50 testing images. I am using SGD instead of gradient descent. Is there any reason to think that gradient descent would be less affected on the initial conditions than the SGD? (different seeds giving similar accuracies)
I would expect strong differences if I would be working with a huge dataset as CIFAR-10 for example, but not with a small dataset as the one described. Am I wrong in the logic I follow?

ptrblck · May 5, 2020, 6:21am

The loss curve of GD is usually less noisy and the loss should decrease in each step.
Using SGD on the other hand gives you more noise, but also the final accuracy is often better than in GD. One might claim that the noisy updates add some regularization, but I’m not sure what the current theory is.

If random seeds give such different results, your overall training is unstable, which is a bad sign.

I would expect it the other way around. From my experience, the more data you have, the less likely it is to get trapped in a local minima.
E.g. if you are dealing with the XOR problem, you might easily stuck in a local minima and the seeds might decide if your model trains fine or not.
If you are training on ImageNet, I doubt that the seeds will make a huge difference. The training success would most likely be defined by the overall training, model architecture, augmentation etc.

erm · May 5, 2020, 7:57am

Thanks for your answer. I need to develop intuition in this area yet. In Computational Chemistry we use also GD, SGD, and the sensitivity to initial conditions increases with the number of atoms, due to the presence of more degrees of freedom. So, my line of thinking was that the binary classification of cat vs non cat , would be less affected than a ten classes classification as CIFAR-10. But I see your point, and the role of the amount of data.