Non Reproducible result with GPU

jaiabhayk · April 13, 2017, 7:01am

Hi,
I am running the below script (which sets the manual seed as 1 for both cpu and gpu), but it does not give me reproducible results for gpu (for cpu it works fine), any known issue or am I missing something ?

github.com

pytorch/examples/blob/master/mnist/main.py

from __future__ import print_function
import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.autograd import Variable

# Training settings
parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
parser.add_argument('--batch-size', type=int, default=64, metavar='N',
                    help='input batch size for training (default: 64)')
parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
                    help='input batch size for testing (default: 1000)')
parser.add_argument('--epochs', type=int, default=10, metavar='N',
                    help='number of epochs to train (default: 10)')
parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
                    help='learning rate (default: 0.01)')
parser.add_argument('--momentum', type=float, default=0.5, metavar='M',

This file has been truncated. show original

torch.manual_seed(args.seed)
if args.cuda:
torch.cuda.manual_seed(args.seed)

albanD · April 13, 2017, 7:57am

If you’re using cudnn, if I remember correctly, some of their kernel are non-deterministic.
Maybe disabling cudnn will help.

sohil · August 5, 2017, 3:39pm

Currently, I am also facing the similar issue of reproducibility on pytorch.

As suggested, I tried disabling cudnn. But pytorch slows down the execution by 5-10x.

Edit:
With cudnn disable, I am still not able to reproduce the results. I found out that the issue is because of torchvision transform file. When I disable the following lines in the code,
transforms.RandomCrop(32, padding=4),
transforms.RandomHorizontalFlip(),

then I am able to reproduce the same results. Setting seed using random.seed() doesn’t work either. Is this behavior expected or am I doing something wrong ?

jaiabhayk · August 6, 2017, 1:39am

Workaround ==> If you set num_workers=0 in the dataloader, it should reproduce.

gquenot · November 2, 2017, 10:24am

I tried that workaround in the simple case of the cifar10 tutorial:
http://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html
after having added only the Flip transform line:
transforms.RandomHorizontalFlip(),
I also inserted a seed initialization:
torch.manual_seed(0)
and I even disabled the shuffle.
I am still not able to reproduce the results. Any other idea?

gquenot · November 2, 2017, 1:28pm

It seems that ransforms.RandomCrop() and transforms.RandomHorizontalFlip() used the “plain” python seed, not the torch one. I fixed the reproducibilty issue with:
import random
random.seed(0)
Reproducibility is conserved while enabling shuffle but not with num_workers>0.

dlutwy · July 24, 2020, 1:01pm

I got the same problem.
My model was trained on 10-fold cross validation. I tried to save all the random states that may contribute to reproducibility every one fold was trained, the state including:

    state = {
        'fold_iter_idx': fold_iter_idx,
        'torch_rng_state': torch.get_rng_state(),
        'cuda_rng_state': torch.cuda.get_rng_state(),
        'np_rng_state': np.random.get_state(),
        'random_rng_state': random.getstate()
    }

Then a new model was initialized to train on next fold. Finally there are 10 results respect with to 10 folds.
I can reproduce same results if I start the program from the very beginning (from the first fold).
But when I tried to reproduce the 7th fold result only with loading the 6th state that saved during previous training, the result was different from that got during previous training.
I tried to disable CUDNN and set

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

I didn’t manage to reproduce it.
But the results with loading 6th states are same,
and I can reproduce the same result as the the results got in from the very beginning training by loading states on CPU.

I am not sure whether I make my issue clear.

albanD · July 24, 2020, 1:27pm

Hi,

You might want to check the reproducibility section of the doc.
In particular, there are few operations that are inherently non-deterministic and so you won’t be able to get reproducible results if you use them.

Barah_Fazili · March 24, 2022, 10:11am

I’am facing a weird situation where on consecutive runs of a script across multiple gpus results are reproduced but running the same script after a few days is now giving different scores which are now reproducible on short term maybe.