I’m trying to manually split my training data into individual batches in a way where I can access the desired batch by indexing. Hence, I can’t rely on DataLoader
to do the batch splitting since it’s unindexable. I’ve tried some approaches to achieve this as per this link, but I’ve been getting some weird behavior.
So what is the proper way to implement this?
@Omar_AlSuwaidi , you should be able to use a simple list for achieving the indexer
import math
import torch
import torch.nn as nn
X = torch.rand(1000,10, 4)
batch_size = 64
num_batches = math.ceil(X.size()[0]/batch_size)
X_list = [X[batch_size*y:batch_size*(y+1),:,:] for y in range(num_batches)]
print(X_list[0].size())
torch.Size([64, 10, 4])
1 Like
Hey thanks for the response. Yeah but unfortunately, my training data does not exist as a type torch.Tensor
, rather it comes from a datasets
object. If you can view the link in the question you’ll get a better idea of what I’m talking about.
@Omar_AlSuwaidi
I don’t think there is any difference at the data level, because i did a 1:1 comparison
seed=42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
def train_func(x):
x = torch.Tensor(plt.imread(x))
print(x.size())
return x
train_data = datasets.Caltech101(root='data', transform=transforms.Compose([
transforms.Resize((128, 128)),
transforms.Grayscale() ,
transforms.ToTensor(),
]), download=True)
BS = 4
num_batches = len(train_data) // BS
print("Num batches is {0}".format(num_batches))
sequence = list(range(len(train_data)))
np.random.shuffle(sequence) # To shuffle the training data
subsets = [Subset(train_data, sequence[i * BS: (i + 1) * BS]) for i in range(num_batches)]
train_loader = [DataLoader(sub, batch_size=BS) for sub in subsets] # Create multiple batches, each with BS number of samples
BS = 4
num_batches = len(train_data) // BS
print("Num batches is {0}".format(num_batches))
np.random.shuffle(train_data) # To shuffle the training data
train_loader1 = [DataLoader(train_data[i*BS: (i+1)*BS], batch_size=BS) for i in range(num_batches)]
Comparison Code
for i, loader in enumerate(train_loader):
loader1 = list(train_loader1[i])
for j, (x,y) in enumerate(loader):
if np.sum((x != loader1[j][0]).detach().numpy().reshape(-1)) != 0:
print("Mismatch at Loader {0} Data Index {1}".format(i, j))
print("Completed")
They are pristine. Please share the entire script as the error is probably the way the models are getting initialized and trained. You need to ensure that there is 1:1 match at the model initialization and training as well
1 Like
Hey, thanks for checking back. Well yeah that’s the whole point, it’s that dividing the train_data
into batches manually using both methods give drastically different results during training (the training procedure is quite simple, it’s exactly as shown in the link for both methods)!
The first method utilizes Subset
class to divide train_data
into batches, while the second method casts train_data
directly into a list, and then indexing multiple batches out of it. While they both are indeed the same at the data level (the order of the images in each batch is identical), training any model with the same weight initialization and random seeds results in very different results (method 1 always gives better results for some reason).
If you read the “UPDATE:” in the link, you might find the behavior mentioned there quite interesting; even though both images from both methods get transformed using T_train
, it seems that something weird is going on with method 2. I believe the information below “UPDATE:” is the key to this whole discussion but I’m not sure what exactly.
@Omar_AlSuwaidi In that case can you provide the T_train function. I feel this is a case of different random seeds. There are two additional seeds that you need to set apart from the ones you have set
1 Like
Yeah sure!
T_train = transforms.Compose([transforms.RandomHorizontalFlip(), transforms.RandomCrop(32, 4), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])
But if the case was due to different random seeds, then one shouldn’t expect such drastic difference in accuracy. Moreover, the time taken during training for both is very different (it seems like the second method is always much faster regardless of what’s in T_train
; also method 2’s performance get’s worse when you add in the random H-Flips and random crops).