How to split your training data into indexable batches?

Omar_AlSuwaidi · January 15, 2022, 4:18am

I’m trying to manually split my training data into individual batches in a way where I can access the desired batch by indexing. Hence, I can’t rely on DataLoader to do the batch splitting since it’s unindexable. I’ve tried some approaches to achieve this as per this link, but I’ve been getting some weird behavior.

So what is the proper way to implement this?

anantguptadbl · January 15, 2022, 5:34pm

@Omar_AlSuwaidi , you should be able to use a simple list for achieving the indexer

import math
import torch
import torch.nn as nn

X = torch.rand(1000,10, 4)
batch_size = 64
num_batches = math.ceil(X.size()[0]/batch_size)
X_list = [X[batch_size*y:batch_size*(y+1),:,:] for y in range(num_batches)]
print(X_list[0].size())

torch.Size([64, 10, 4])

Omar_AlSuwaidi · January 16, 2022, 12:37am

Hey thanks for the response. Yeah but unfortunately, my training data does not exist as a type torch.Tensor, rather it comes from a datasets object. If you can view the link in the question you’ll get a better idea of what I’m talking about.

anantguptadbl · January 16, 2022, 10:58am

@Omar_AlSuwaidi
I don’t think there is any difference at the data level, because i did a 1:1 comparison

seed=42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

def train_func(x):
    x = torch.Tensor(plt.imread(x))
    print(x.size())
    return x


train_data = datasets.Caltech101(root='data', transform=transforms.Compose([
                                               transforms.Resize((128, 128)),
                                               transforms.Grayscale() ,
                                               transforms.ToTensor(),
                                           ]), download=True)
BS = 4
num_batches = len(train_data) // BS
print("Num batches is {0}".format(num_batches))
sequence = list(range(len(train_data)))
np.random.shuffle(sequence)  # To shuffle the training data
subsets = [Subset(train_data, sequence[i * BS: (i + 1) * BS]) for i in range(num_batches)]
train_loader = [DataLoader(sub, batch_size=BS) for sub in subsets]  # Create multiple batches, each with BS number of samples

BS = 4
num_batches = len(train_data) // BS
print("Num batches is {0}".format(num_batches))
np.random.shuffle(train_data)  # To shuffle the training data
train_loader1 = [DataLoader(train_data[i*BS: (i+1)*BS], batch_size=BS) for i in range(num_batches)]

Comparison Code

for i, loader in enumerate(train_loader):
    loader1 = list(train_loader1[i])
    for j, (x,y) in enumerate(loader):
        if np.sum((x != loader1[j][0]).detach().numpy().reshape(-1)) != 0:
            print("Mismatch at Loader {0} Data Index {1}".format(i, j))
print("Completed")

They are pristine. Please share the entire script as the error is probably the way the models are getting initialized and trained. You need to ensure that there is 1:1 match at the model initialization and training as well

Omar_AlSuwaidi · January 16, 2022, 8:35pm

Hey, thanks for checking back. Well yeah that’s the whole point, it’s that dividing the train_data into batches manually using both methods give drastically different results during training (the training procedure is quite simple, it’s exactly as shown in the link for both methods)!

The first method utilizes Subset class to divide train_data into batches, while the second method casts train_data directly into a list, and then indexing multiple batches out of it. While they both are indeed the same at the data level (the order of the images in each batch is identical), training any model with the same weight initialization and random seeds results in very different results (method 1 always gives better results for some reason).

If you read the “UPDATE:” in the link, you might find the behavior mentioned there quite interesting; even though both images from both methods get transformed using T_train, it seems that something weird is going on with method 2. I believe the information below “UPDATE:” is the key to this whole discussion but I’m not sure what exactly.

anantguptadbl · January 17, 2022, 4:19am

@Omar_AlSuwaidi In that case can you provide the T_train function. I feel this is a case of different random seeds. There are two additional seeds that you need to set apart from the ones you have set

Omar_AlSuwaidi · January 17, 2022, 4:29am

Yeah sure!

T_train = transforms.Compose([transforms.RandomHorizontalFlip(), transforms.RandomCrop(32, 4), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])

But if the case was due to different random seeds, then one shouldn’t expect such drastic difference in accuracy. Moreover, the time taken during training for both is very different (it seems like the second method is always much faster regardless of what’s in T_train; also method 2’s performance get’s worse when you add in the random H-Flips and random crops).