Sampler with multiple dataset simoultaneous iteration

Torcione · November 2, 2021, 3:23pm

I need to iterate simoultaneously on multiple dataset (let’s say 2 dataset) keeping the element of each of them isolated (each batch must contain only element of one dataset and for each step I want to work with one batch from each dataset). To do so I think that torch.utils.data.TensorDataset can be the right tool, for example:

dataset = torch.utils.data.TensorDataset(dataset1, dataset2)
dataloader = DataLoader(dataset, batch_size=128, shuffle=True)
for index, (xb1, xb2) in enumerate(dataloader):
    ....

where xb1 refers to the input data and target associated to one of the 2 dataset.
My first question is: have I understood well the use of torch.utils.data.TensorDataset ? Does this approach solve my problem?

My second question is: how to put a sampler in the dataloader in this situation? can I, for example, define 2 indeces tensor Idx1 and Idx2 and put in DataLoader an option like sampler = (Idx1, Idx2)

EDIT:
An alternative approach could be to create a dataloader for each dataset, each one with his own sampler and use zip() to iterate simoultaneously on the 2 dataset. Is there a more clean solution for that (also beacosu I read that (source):

cycle() and zip() might create a memory leakage problem - especially when using image datasets!

)?

someshfengde · November 2, 2021, 3:26pm

using custom dataset might be best option here

https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#creating-a-custom-dataset-for-your-files

Torcione · November 2, 2021, 3:57pm

Hi, thank you for your comment!
Then how can I prevent the mixing of element associated to different dataset inside a given batch?
And how to perform operation on a pair of batches (one for each dataset) at each step?

someshfengde · November 10, 2021, 6:30am

If you are gonna use the dataset class from Pytorch you will have whole control how your data is gonna load and you can ensure that no memory leak happens

for making pairs of batches you can use dataloader API and create batches of the dataset according to your need slr was quite preoccupied with work