Dataloader access two items at the same time

Hello,

I am looping on dataloader, but I want to access two items and the same time
Example:

dataset1 = ImageFolder(input_folder, transform=transform)
train_loader_a= DataLoader(dataset=dataset1, batch_size=batch_size, shuffle=train, drop_last=True, num_workers=num_workers)

dataset2 = ImageFolder(input_folder, transform=transform)
train_loader_b= DataLoader(dataset=dataset1, batch_size=batch_size, shuffle=train, drop_last=True, num_workers=num_workers)


for it, (images_a, images_b) in enumerate(zip(train_loader_a, train_loader_b)):
        images_a, images_b = Variable(images_a.cuda()), Variable(images_b.cuda())

Here in every iteration, I am loading one tensor from train_loader_a, and one tensor from train_loader_b
what I want to do, that for every iteration, I load two consecutive tensors from each dataloader, while restricting that this is within the same iteration, so for example
train_loader_a has 5 images of names:

a_1
a_2
a_3
a_4
a_5

train_loader_b has 5 images with names:

b_1
b_2
b_3
b_4
b_5

for it = 0, i want to load a_1, a_2 and b_1, b_2
for it = 1, i want to load a_2, a_3 and b_2, b_3
for it = 2, i want to load a_3, a_4 and b_3, b_4

and so on

is there a way to do so ? I tried providing the indices like train_loader_a[it+1] but I received an error TypeError: 'DataLoader' object does not support indexing

Are you shuffling the Datasets? If so, try to set shuffle=False and run it again.
However, I’m a bit skeptical if your approach will always work, as you might want to use multiprocessing in your DataLoaders, which might lead to inconsistent racing between the workers.

I would rather create a new Dataset, initializing both of your datasets in it and just return deterministically two samples in __getitem__.
The code would look like this:


class MyDataset(Dataset):
    def __init__(self, dataset1, dataset2):
        self.dataset1 = dataset1 # datasets should be sorted!
        self.dataset2 = dataset2

    def __getitem__(self, index):
        x1 = self.dataset1[index]
        x2 = self.dataset2[index]

        return x1, x2

    def __len__(self):
        return len(self.dataset1) # assuming both datasets have same length

dataset = MyDataset(dataset1, dataset2)
loader = DataLoader(dataset, batch_size=...)

Both your datasets created from ImageFolder should return sorted images.
If that’s not the case, you would have to pass the image folder paths to MyDataset, sort the image paths for both folders, and lazily load the images in __getitem__.

I think this is not what I want
what you have written will return one tensor from here and there

I want to return two consecutive tensors from the same dataset at the same time

so I already have two different loaders for both datasets

If you set the batch_size=2, you would get a1, b1, a2, b2. Would that work or do I still misunderstand you?

Ah okay, then without implementing your approach, if I set in the beginning the batchsize=2, i can get images_a[0] and images_a[1]

I tried it and it worked

Thanks a lot for your help

and for sure, in order to ensure that the frames are subsequent, I have to make sure that the list is sorted, and I am already doing so, without shuffling also

That could work, but I’m still concerned about the multiprocessing part, i.e. race conditions, which might mix up your order even though the paths are sorted.

Sorry I don’t get what you mean

I am not expert with torch, so I don’t get what you mean about race conditions, multiprocessing or so

If you set num_workers > 0 in your DataLoader, multiple workers will be created and load your batches of data in the background.
E.g. for multiple workers you will get batch1 from worker1, batch2 from worker2 etc., then it will start with worker1 again.

I’m not sure, if the order is somehow enforced or if the next ready worker just puts his data into the queue.
In the latter case, it might be that somehow the order might be broken, and worker2 finishes before worker1.
This would yield your data like:

a1, b1
a2, b2,
a4, b4,
a3, b3,
...

Would this be troublesome or can you handle this case?

i have num_workers=4
So you mean that for example I will receive

a1 a2, b1 b2
a2 a3, b2 b3

and then I can receive afterwards
a5 a6 b5 b6
?

Apparently I’m mistaken and the order is enforced. At least I couldn’t get wrong ordering.

hi @ptrblck
I am referring to the same task, but another matter in another issue, if you can please have a look on it, maybe you can help

Hi folks, I was looking at this issue, because I too want to combine multiple dataloaders-

so if from dataloaderA, if i get sample a_1, a_2, ..., a_batch_size
and if from dataloaderB, if i get sample b_1, b_2, ..., b_batch_size

I want to have a dataloaderAB, where I get ((a_1, a_2, ..., a_batch_size), (b_1, b_2, ..., b_batch_size)) when I iterate over dataloaderAB.

The issue that prevents me from using the previously suggested solution of making a dataset that yields a tuple of items in the following form is that each dataloader, uses a different collate function.

Any thoughts?

Assuming you want to load a single batch from each DataLoader wouldn’t the original approach using zip work?