Different behavior for torch.manual_seed() when used globally VS when used in a dataloader

#torch.manual_seed(seed)

tr_data_setup = DS(features, labels.reshape(-1,1))
tr_dataloader = DL(tr_data_setup, batch_size=64, shuffle=True, generator=torch.manual_seed(seed))

xx, yy = next(iter(tr_dataloader))

Why do I get two different results even though my seed is the same?

How are you creating the input data? Are you randomly sampling it or creating it in a deterministic manner?
Could you post a code snippet to reproduce this issue and explain your use case a bit more?

The data is deterministic. Features and Labels are taken from a csv file. As for the use case, I’m working on seed averaging a bunch of models on different splits of the data and would like the training results to be reproducible. I first realized something might be amiss since nn.BCELoss() would trigger a CUDA error if I didn’t explicitly set the generator.

The snippet below is taken from a test performed on “mnist_train_small.csv” provided in Colab which gave differing results as well.

class DS(Dataset):
    def __init__(this, X=None, y=None, mode="train"):
        this.mode = mode
        this.X = X
        if mode == "train":
            this.y = y

    def __len__(this):
        return this.X.shape[0]

    def __getitem__(this, idx):
        if this.mode == "train":
            return torch.FloatTensor(this.X[idx]), torch.LongTensor(this.y[idx])
        else:
            return torch.FloatTensor(this.X[idx])

data = pd.read_csv("/content/sample_data/mnist_train_small.csv")

print(data.head(5)) # First 5 labels 5, 7, 9, 5, 2

X = data.iloc[:, 1:].copy().values
y = data.iloc[:, 0].copy().values
dl_setup = DS(X, y.reshape(-1,1))

tY = []
for i in range(5):
    tx, ty = dl_setup.__getitem__(i)
    tY.append(ty)

tY # 5, 7, 9, 5, 2
dl = DL(dl_setup, batch_size=16, shuffle=False)

xx, yy = next(iter(dl))

print(yy[:5]) #5, 7, 9, 5, 2
torch.manual_seed(0)
dl = DL(dl_setup, batch_size=16, shuffle=True)

xx, yy = next(iter(dl))

print(yy[:5]) #3, 7, 7, 8, 3
dl = DL(dl_setup, batch_size=16, shuffle=True, generator=torch.manual_seed(0))

xx, yy = next(iter(dl))

print(yy[:5]) #0, 3, 0, 7, 9

Hi @ptrblck. I’ve provided the snippets as asked.

Hey Prashanth, let me try to explain a concept of dataset and dataloader in pytorch as I understand it.
Dataset is a whole set of your data points (samples) you want to iterate through during training. It defines a logic on how to access the sample at some index idx. Dataloader - is a helper lets say from the smart guys in the pytorch which is helping you to iterate through the data by batches of some batch_size and doing a lot of intrinsically to make it fast and efficient.

While doing next method on iterable object in Python you are basically looping through it and each next is producing the next sample from your iterable the same way as for sample in samples: doing it.

Then iterating through dataloader it will provide you with a next batch every iteration as the concept is “model wants to see all the data during epoch”. Even more so shuffle=True will make sure data is reshuffled before every epoch, so you’ll have batches in different order on the second epoch even if set seed manually.

As for reproducibility, dataloader instantiated second time with the same seed (lets say for the next experiment) will provide you with the same sequence of data samples as the first instance of dataloader.

Consider these examples.
Setup:

import torch
from torch.utils.data import TensorDataset, DataLoader
t = torch.arange(100)
ds = TensorDataset(t)

First I iterate through dataloader for 2 “epochs”:

dl = DataLoader(ds, batch_size=16, shuffle=True, generator=torch.manual_seed(0))
# iterate through dataloader first time, print first 3 yy's
print("first run through dl")
for i, yy in enumerate(dl):
    if i < 3:
        print(yy[0][:5])

# iterate through dataloader second time, print first 3 yy's
print("second run through dl")
for i, yy in enumerate(dl):
    if i < 3:
        print(yy[0][:5])

It prints this:

first run through dl
tensor([33, 70, 17, 63, 71])
tensor([90, 64, 11, 30, 91])
tensor([43, 31, 92, 94, 19])
secondrun through dl
tensor([15,  9, 50, 34, 51])
tensor([82, 70, 73, 13, 57])
tensor([89, 23, 36, 55, 84])

In the next snippet I instantiate dataloader two times:

dl = DataLoader(ds, batch_size=16, shuffle=True, generator=torch.manual_seed(0))
# iterate through dataloader first time, print first 3 yy's
print("first run through dl")
for i, yy in enumerate(dl):
    if i < 3:
        print(yy[0][:5])

# instantiate for next experiment
dl = DataLoader(ds, batch_size=16, shuffle=True, generator=torch.manual_seed(0))
# iterate through dataloader second time, print first 3 yy's
print("secondrun through dl")
for i, yy in enumerate(dl):
    if i < 3:
        print(yy[0][:5])

and it prints this:

first run through dl
tensor([33, 70, 17, 63, 71])
tensor([90, 64, 11, 30, 91])
tensor([43, 31, 92, 94, 19])
second run through dl
tensor([33, 70, 17, 63, 71])
tensor([90, 64, 11, 30, 91])
tensor([43, 31, 92, 94, 19])

It is a lengthy post, I know :slight_smile: But, hope it helps!

Thank You for the reply. I’ve used your code snippets.

from torch.utils.data import TensorDataset, DataLoader
t = torch.arange(100)
ds = TensorDataset(t)
# Setting Seed explicitly in dataloader
dl = DataLoader(ds, batch_size=16, shuffle=True, generator=torch.manual_seed(0))

print("first run through dl")
for i, yy in enumerate(dl):
    if i < 3:
        print(yy[0][:5])

print("second run through dl")
for i, yy in enumerate(dl):
    if i < 3:
        print(yy[0][:5])
first run through dl
tensor([33, 70, 17, 63, 71])
tensor([90, 64, 11, 30, 91])
tensor([43, 31, 92, 94, 19])
second run through dl
tensor([15,  9, 50, 34, 51])
tensor([82, 70, 73, 13, 57])
tensor([89, 23, 36, 55, 84])
# Setting Seed "globally"
torch.manual_seed(0)
dl = DataLoader(ds, batch_size=16, shuffle=True)
print("first run through dl")
for i, yy in enumerate(dl):
    if i < 3:
        print(yy[0][:5])

print("second run through dl")
for i, yy in enumerate(dl):
    if i < 3:
        print(yy[0][:5])
first run through dl
tensor([63, 70, 43, 75, 77])
tensor([78,  6, 23, 66, 44])
tensor([31, 84, 24, 73, 54])
second run through dl
tensor([71, 54, 40, 70, 80])
tensor([29, 90, 96, 56, 89])
tensor([92, 53, 41, 60, 78])

Both results are reproducible.If I understand correctly, the results happen to be different because the samples drawn, although use the same seed, are different instances and these instances are reproducible.

Edit:
Played around a bit. That explains it. Thank You.

torch.manual_seed(0)
x = torch.randn(5)
print(x)
print(torch.randn(5))
print(torch.randn(5))
# Reproducible
tensor([ 1.5410, -0.2934, -2.1788,  0.5684, -1.0845])
tensor([-1.3986,  0.4033,  0.8380, -0.7193, -0.4033])
tensor([-0.5966,  0.1820, -0.8567,  1.1006, -1.0712])
torch.manual_seed(0)
x = torch.randn(5)
print(x)

torch.manual_seed(0)
print(torch.randn(5))

torch.manual_seed(0)
print(torch.randn(5))
# Reproducible
tensor([ 1.5410, -0.2934, -2.1788,  0.5684, -1.0845])
tensor([ 1.5410, -0.2934, -2.1788,  0.5684, -1.0845])
tensor([ 1.5410, -0.2934, -2.1788,  0.5684, -1.0845])
1 Like