Combine two dataloaders

sparshgarg23 · July 8, 2022, 8:15am

Given two datasets of length 8000 and 1480 and their corresponding train and validation loaders,I would like o create a new dataloader that allows me to iterate through those loaders.

I tried using concatenate datasets as shown below

class custom_dataset(Dataset):
  def __init__(self,*data_sets):
    self.datasets=data_sets
  def __getitem__(self,i):
    return tuple(d[i] for d in self.datasets)
  def __len__(self):
    return min(len(d) for d in self.datasets)

new_dataset=custom_dataset(dataset,dataset1)

new_train_loader=DataLoader(new_dataset,batch_size=16,sampler=train_sampler,num_workers=2,drop_last=True)
new_val_loader=DataLoader(new_dataset,batch_size=16,sampler=val_sampler,num_workers=2,drop_last=True)

print("Training samples {},Val Samples {}".format(len(new_train_loader),len(new_val_loader)))

Training samples 74,Val Samples 18 this is the same number of samples for dataset 2
Training samples 400,Val Samples 100 for dataset 1

Since we are concatenating our dataset shouldn’t the final no of train samples be 474 and not 74?

But then when I print the lenght of the dataset it comes out to be 1480 and not 9400 why is that?
If we concatenate two datasets having 8000 and 1480 samples shouldn’t the final dataset be of length 9400 and not 1480.

Although I am able to visualize both samples from the two datasets,I somehow feel that a majority of the samples in the first dataset have not been included.
Any suggestions on what is wrong with my following code

ptrblck · July 8, 2022, 8:23am

You are not concatenating the datasets, but are returning tuples from both while using the min(len()) of both datasets.
If you want to concatenate both dataset (i.e. iterate through them in a sequential way) use ConcatDataset.

sparshgarg23 · July 8, 2022, 9:15am

thanks for the suggestion,I will look into it
I was able to make it work but it ends up returning only one image which could either belong to Dataset 1 or dataset 2.

Is there a way to combine my previous code with the principles of concatDataset so that I can iterate through it in a sequential way and have it return tuples from both datasets.
Meaning final output of concat dataset will be
{image_1,label_1,image_2,label_2}

ptrblck · July 8, 2022, 3:59pm

I don’t understand what the difference to your current implementation would be as you are currently already returning tuples and are thus not concatenating the datasets.
Since both datasets might have a different length, you are using the min of them for the iteration, which sounds also right.
E.g. assuming the first dataset has 2 samples, while the second one has 4, how should the tuple creation look in this case if you are not using the min length?

sparshgarg23 · July 8, 2022, 4:20pm

although the tuple creation is correct,when I initiate training only the first 74/400 samples are taken from the dataset 1.The rest of them are ignored.

This means that during training the train error decreases and the validation error increases.
I would like to use all the training samples from both datasets and then pass them through the model.
the reason why i need to iterate through both datasets is to compute feature similiarity between the two samples from dataset 1 and dataset 2.

I tried using iter as shown in the below implementation

len_source=len(train_loader_1)-1
len_target=len(val_loader)-1
iter_source=iter(train_loader_1)
iter_target=iter(val_loader)

for iter_num in range(epochs):
  optimizer.zero_grad()
  if iter_num%len_source==0:
    iter_source=iter(train_loader_1)
  if iter_num%len_target==0:
    iter_target=iter(val_loader)
  data_source=iter_source.next()
  data_target=iter_target.next()
  input_source,input_label=data_source
  input_tgt,label_tgt=data_target
  input_source=input_source.to(device)
  input_label=input_label.to(device)
  input_tgt=input_tgt.to(device)
  label_tgt=label_tgt.to(device)
  src_feature,src_output=model(input_source)
  tgt_feature,tgt_output=model(input_tgt)

This ends up working well,but I am not sure if all the batches are being taken into consideration or if it’s just one batch at each epoch.
As per my understaning,if my epochs is set to 10 then I will end up calling iter.next will only consider 10 batches from each dataset

ptrblck · July 8, 2022, 9:16pm

How would you like to draw samples from both datasets if their lengths are unequal?
Currently you are limiting the length to the min length of both datasets as already explained, which is one possibility. If you want to repeat the smaller dataset until all samples from the larger one are used, you could use a modulo operation in the indexing of the smaller dataset.

sparshgarg23 · July 9, 2022, 2:07pm

If you want to repeat the smaller dataset until all samples from the larger one are used, you could use a modulo operation in the indexing of the smaller dataset.

Could you let me know how to implement the modulo operation for indexing the smaller dataset.
Should i use it in the training loop or in the custom dataset block?

ptrblck · July 10, 2022, 12:44am

You can use it in the __getitem__:

class MyDataset(Dataset):
    def __init__(self):
        self.small_dataset = torch.arange(10).view(10, 1)
        self.large_dataset = torch.arange(20).view(20, 1)
        
    def __getitem__(self, index):
        a = self.small_dataset[index % len(self.small_dataset)]
        b = self.large_dataset[index]
        return a, b
    
    def __len__(self):
        return len(self.large_dataset)

dataset = MyDataset()
print(len(dataset))
# 20

loader = DataLoader(dataset, batch_size=2)

for idx, (a, b) in enumerate(loader):
    print("iter {}\na {}\nb {}".format(idx, a, b))

jS5t3r · September 21, 2023, 6:05pm

can I somehow merge a and b in the class MyDataset?

a = (data_a, targets_a)
b = (data_b, targets_b)

res = (data_a + data_b, targets_a + targets_b)

ptrblck · September 21, 2023, 6:52pm

Assuming data_x and targetx_x are tensors you can concatenate them via torch.cat and pass the new tensor to the custom Dataset.

jS5t3r · September 21, 2023, 6:58pm

I understand, I want to have exactly 50:50 from both dataset.

I did it in the for loop.