Why iterating over a deterministic dataloader changes the seed state?

By switching eval_testset = False to eval_testset = True in the above codes (i.e., whether iterating over a deterministic dataloader with shuffle=False and fixed transforms), the program with a fixed seed gets different results:

import torch
import numpy as np
import random
import torchvision
import torchvision.transforms as transforms

def set_seed(seed):
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=32,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=32,
                                         shuffle=False, num_workers=2)

set_seed(42)
eval_testset = True

for epoch in range(10):
    if epoch >= 2:
        break # for the purpose of saving time

    for i, data in enumerate(trainloader):
        inputs, labels = data
        # train the model ...
        if i == 0:
            print("eval_testset=%s, epoch=%d, iter=%d, loaded labels=%s"%(eval_testset, epoch, i, labels.tolist()))

    if eval_testset:
        with torch.no_grad():
            for data in testloader:
                images, labels = data
                # test the model ...

eval_testset = False:

eval_testset=False, epoch=0, iter=0, loaded labels=[6, 0, 4, 1, 2, 7, 9, 4, 7, 8, 4, 5, 6, 0, 4, 2, 0, 1, 6, 1, 4, 3, 2, 3, 2, 4, 0, 7, 5, 1, 8, 6]
eval_testset=False, epoch=1, iter=0, loaded labels=[8, 7, 1, 6, 2, 3, 0, 7, 7, 2, 9, 8, 8, 7, 5, 9, 0, 6, 9, 6, 7, 5, 9, 3, 7, 8, 9, 4, 4, 4, 0, 7]

eval_testset = True:

eval_testset=True, epoch=0, iter=0, loaded labels=[6, 0, 4, 1, 2, 7, 9, 4, 7, 8, 4, 5, 6, 0, 4, 2, 0, 1, 6, 1, 4, 3, 2, 3, 2, 4, 0, 7, 5, 1, 8, 6]
eval_testset=True, epoch=1, iter=0, loaded labels=[9, 5, 7, 2, 2, 3, 3, 6, 2, 4, 2, 4, 1, 4, 1, 1, 6, 9, 7, 6, 9, 4, 8, 1, 3, 4, 7, 9, 3, 9, 3, 9]

I am wondering how can a deterministic dataloader change the seed state of the codes.
(environment: tested on pytorch 1.13.1 and 2.0.1)

Hi Linglan!

I can reproduce your issue (running pytorch version 2.2.1 on both the
cpu and gpu.)

I’ve narrowed it down to the operation that create an iterator (in my case a
_SingleProcessDataLoaderIter) from a DataLoader. Here is a script
that shows that merely creating the iterator advances the state of the
pseudo-random-number generator (while iterating over the iterator does
not seem to further advance the rng):

import torch
print (torch.__version__)

print ('reset seed ...')
_ = torch.manual_seed (2024)
print ('first eight rands ...')
print ('torch.rand (8):', torch.rand (8))

print ('reset seed ...')
_ = torch.manual_seed (2024)
print ('create tensors, Dataset, and Dataloader ...')

inp = torch.arange (4.0)
trg = torch.arange (4)
ds = torch.utils.data.TensorDataset (inp, trg)
dl = torch.utils.data.DataLoader (ds, batch_size = 2)

print ('no rands consumed, still the same first eight rands ...')
print ('torch.rand (8):', torch.rand (8))

print ('reset seed ...')
_ = torch.manual_seed (2024)
print ('create iterator from DataLoader ...')

itdl = iter (dl)

print ('two rands consumed, rands start with third of the first eight ...')
print ('torch.rand (8):', torch.rand (8))

print ('reset seed ...')
_ = torch.manual_seed (2024)
print ('extract two batches from DataLoader iterator ...')

print ('next (itdl) =', next (itdl))
print ('next (itdl) =', next (itdl))

print ('no rands consumed, get first eight rands again ...')
print ('torch.rand (8):', torch.rand (8))

And here is its output:

2.2.1
reset seed ...
first eight rands ...
torch.rand (8): tensor([0.5317, 0.8313, 0.9718, 0.1193, 0.1669, 0.3495, 0.2150, 0.6201])
reset seed ...
create tensors, Dataset, and Dataloader ...
no rands consumed, still the same first eight rands ...
torch.rand (8): tensor([0.5317, 0.8313, 0.9718, 0.1193, 0.1669, 0.3495, 0.2150, 0.6201])
reset seed ...
create iterator from DataLoader ...
two rands consumed, rands start with third of the first eight ...
torch.rand (8): tensor([0.9718, 0.1193, 0.1669, 0.3495, 0.2150, 0.6201, 0.4849, 0.7492])
reset seed ...
extract two batches from DataLoader iterator ...
next (itdl) = [tensor([0., 1.]), tensor([0, 1])]
next (itdl) = [tensor([2., 3.]), tensor([2, 3])]
no rands consumed, get first eight rands again ...
torch.rand (8): tensor([0.5317, 0.8313, 0.9718, 0.1193, 0.1669, 0.3495, 0.2150, 0.6201])

I can’t say that this behavior actually violates anything in the pytorch
documentation, but, even so, I would say that it’s sufficiently odd and
unexpected to count as a bug (and is seemingly completely unnecessary).

It might make sense for you to file a github issue about this.

Best.

K. Frank

1 Like

Creating the base_seed should be responsible for this behavior as seen here.

1 Like

Thank you for your reply!

Thanks for your reply!

Hi @ptrblck!

This indeed looks like the cause.

It’s hardly a big deal, but still, why consume some random numbers
unnecessarily?

It appears that this is strictly for multiprocessing. _BaseDataLoaderIter
knows whether or not it will be used for multiprocessing (as, for example,
through its _num_workers property).

Or, probably better in my mind, why not push the initialization of _base_seed
down into _MultiProcessingDataLoaderIter, which has the (only?) code
that actually uses _base_seed?

I can’t say that this is really a major issue, but it was unexpected enough to
catch at least one pytorch user by surprise.

As an aside, in Linglan’s case:

he would still see his issue even if the problem were “fixed” because he is
explicitly asking for multiprocessing with num_workers = 2. Even though
there is not actually any randomness in Linglan’s testloader, I don’t see
any practical way for this to be known by _BaseDataLoaderIter (or by
_MultiProcessingDataLoaderIter) if randomness were hiding in some
transform or collate_fn code.

Best.

K. Frank

I agree with your point in trying to avoid seeding code when it’s not really needed as it can indeed create unexpected issues.
I also agree with your suggestion to create an issue on GitHub discussing this.

Thank you again for both of your valuable discussions. I have (re)opened an issue on GitHub to discuss it.

1 Like