Iterating through two dataloaders having different sized datasets

Divyanshu_Mishra · June 16, 2020, 4:05pm

Continuing the discussion from Memory error when trying to train with two different dataloaders in parallel:

Hello I am trying to do the same thing. I am not getting any memory error but it is taking very long to train. It is taking around 2 hours to complete one epoch. The size of the larger dataset is 25000 images and the size of smaller dataset is 5000 images. The shape of both the images is 5125123 and the I am using a batchsize of 32. I am using 3 1080ti gpu’s for training with a memory of 512 gb. Please help me if there is some efficient way to iterate through the two dataloaders.

Divyanshu_Mishra · June 16, 2020, 5:02pm

@ptrblck Please guide me if you have any suggestions. Thank you

cskarthik7 · June 16, 2020, 6:20pm

You can do is make dataloaders of same size i.e. adjusting the batch size such that both the dataloaders have same length.

For e.g. for 25000 images, you can make batch size as 25 and for 5000 images you can make batch size as 5 , so both the dataloaders will be having same length (1000)

Divyanshu_Mishra · June 17, 2020, 3:32am

I tried this but the speed is the same as previous. Previously I did something like this:
for index,data in enumerate(zip(dataloader1,cycle(dataloader2)):
The dataloader2 is the dataloader of the small size dataset, hence to prevent dataloader2 from exhausting, I use the cycle function from itertools which repeats the dataloader2 until dataloader1 is exhausted.

cskarthik7 · June 17, 2020, 5:40pm

Can you share your whole code?

Divyanshu_Mishra · June 17, 2020, 6:07pm

data_loader1_train = DataLoader(train_cls,batch_size=48,shuffle=False,sampler=sampler,num_workers=8,pin_memory=True)
data_loader1_valid = DataLoader(valid_cls,batch_size=48,shuffle=True,num_workers=8,pin_memory=True)
data_loader2_train = DataLoader(train_reg,batch_size=8,shuffle=True,num_workers=4,pin_memory=True)
data_loader2_valid = DataLoader(valid_reg,batch_size=8,shuffle=True,num_workers=4,pin_memory=True)

for x in range(epochs):
    model.train()
    print(model.training)
    for index,data in enumerate(zip(data_loader1_train,data_loader2_train)):
        x1,y1,x2,y2 = data[0][0],data[0][1],data[1][0],data[1][1]
        x1,y1,x2,y2 = x1.to(device),y1.to(device),x2.to(device),y2.to(device)
        x1,y1,x2,y2 = x1.double(),y1,x2.double(),y2
        #print(x1.shape,x2.shape,y1.shape)
        #print(x1.shape,y1.shape,x2.shape,y2.shape)
        pred1,_ = model(x1)
        
        _,pred2 = model(x2)
        
        
        
        loss_1 = loss1(pred1,y1).double()
        loss_2 = loss2(pred2,y2).double()
        total_loss = loss_1 + loss_2
        print(f"Loss1:{loss_1} Loss2:{loss_2} TotalLoss:{total_loss}")
        total_loss.backward()
        optimizer.step()
        optimizer.zero_grad()

cskarthik7 · June 17, 2020, 6:46pm

Since you have 25000 images, I can provide you an easy way.
When you read the images like you can use Image from PIL or you can use cv2.imread.
So basically store the images as numpy array in a csv file with the corresponding labels.
By this way the time will be reduced as the dataset won’t read the images again and again for every epoch, instead it will directly take the already precomputed numpy arrays.

Till then I’ll try to debug your code!

Cheers!