Loading same data but getting different results

Omar_AlSuwaidi · January 9, 2022, 9:21pm

So I’m trying to manually split my training data into batches such that I can easily access them via indexing, and not relying on DataLoader to split them up for me, since that way I won’t be able to access the individual batches by indexing. So I tried the following:

train_data = datasets.ANY(root='data', transform=T_train, download=True)
BS = 200
num_batches = len(train_data) // BS
sequence = list(range(len(train_data)))
np.random.shuffle(sequence)  # To shuffle the training data
subsets = [Subset(train_data, sequence[i * BS: (i + 1) * BS]) for i in range(num_batches)]
train_loader = [DataLoader(sub, batch_size=BS) for sub in subsets]  # Create multiple batches, each with BS number of samples

Which works during training just fine.

However, when I attempted another way to manually split the training data I got different end results, even with all the same parameters and the following settings:

device = torch.device('cuda')
torch.manual_seed(0)
np.random.seed(0)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
torch.cuda.empty_cache()

I only split the training data the following way this time:

train_data = list(datasets.ANY(root='data', transform=T_train, download=True))  # Cast into a list
BS = 200
num_batches = len(train_data) // BS
np.random.shuffle(train_data)  # To shuffle the training data
train_loader = [DataLoader(train_data[i*BS: (i+1)*BS], batch_size=BS) for i in range(num_batches)]

But this gives me different results than the first approach, even though (I believe ) that both approaches are identical in manually splitting the training data into batches. I even tried not shuffling at all and loading the data just as it is, but I still got different results (85.2% v.s 81.98% accuracy). I even manually checked that the loaded images from the batches match; and are the same using both methods.

Not only that, when I load the training data the conventional way as follows:

BS = 200
train_loader = DataLoader(train_data, batch_size=BS, shuffle=True)

I get even more drastic results!

Can somebody please explain to me why these differences arise, and how to fix it?

nivek · January 11, 2022, 10:10pm

You have confirmed that “the loaded images from the batches match” between the first two methods. What happens afterward? Do you set your seed (e.g. torch.manual_seed(0)) before training?

Omar_AlSuwaidi · January 11, 2022, 10:22pm

Hey, yeah the order of the images from each batch is the same in all batches using both approaches. And before training, I’ve set the following:

device = torch.device('cuda')
torch.manual_seed(0)
torch.cuda.manual_seed_all(0)
np.random.seed(0)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
torch.cuda.empty_cache()

Moreover, I would like to share an update that might shed some light:

T_train transformation contains some random transformations (H_flip, crop) and when using it along with the first train_loader, the time taken during training was: 24.79s/it, while the second train_loader took: 10.88s/it (even though both have the exact same number of parameters updates/steps). So I decided to remove the random transformations from T_train; then the time taken using the first train_loader reduced to: 16.99s/it, while the second train_loader remained at: 10.87s/it. So somehow, the second train_loader still took the same time (with or without the random transformations). Thus, I decided to bring back the random transformations in T_train to visualize the image outputs from the second train_loader to verify if the random transformations were being applied, and indeed they were! So this is really confusing and I’m not quite why they’re giving different results.

nivek · January 11, 2022, 10:43pm

I am unsure why your second train_loader still has the random transformations after you remove them. Perhaps you need clear all the variable and re-run everything?

Aside from that, the other pitfall may be nondeterministic algorithms:
https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms

H_MP · August 8, 2022, 6:18pm

I had same problem and in my case it was due to loading images

img = cv2.imread(path)
img = cv2.cvtColor(img, cv2_COLORBGR2GRAY)`
img = cv2.resize( img, (32,32) )

but when I change it to

transform_valid = transforms.Compose([
    transforms.ToPILImage(), 
    transforms.Resize((32, 32)),
    transforms.ToTensor()])

img = cv2.imread(path, 0)
img = custom_transfrom(img).unsqueeze(0)

my problem solved