So, let’s say I grab images from two folders. Now, I concat them and pass the final dataset to the dataloader. Dataset1 has 100 images and Dataset2 has 100 images as well. My batch size is 7. Now, ideally I want 14 image batches from the first Dataset1 and 14 image batches from the second Dataset2 and, so, a total of 28 batches.
So, the first 14 batches of the 28 batches contain images from Dataset1 only and the next 14 batches contain images from Dataset2 only. Will ConcatDataset make sure that is the case?
It does not work like that.
Each dataset has a length.
Concat dataset will generate a dataset of len1+len2.
It will shuffle randomly indices from both datasets and will generate a batch. If idx belongs to dataset1 it will call dataset1.getitem and so on.
You can expect that, statistically talking, each batch has half and half as it comes from a random distribution and len1 = len2 but it won’t ensure that.
Thank you, that really explains the bad behavior of my neural net. So, the obvious follow-up question would be is there a way to get my desired behavior? I was thinking concatenating DataLoaders instead since the DataLoader for each Dataset would contain images from that particular dataset online batch-wise. After concatenating I wouldn’t have to worry about a mix-up. Anything like that possible in Pytorch?
No as far as I know but you can create a new dataset class which contains both datasets as an object.
Then, just return 2 tensors concatenated in an arbitrary dimension and reshape it in the batch, something like
class BigDataset()
dataset1 = dataset()
dataset2 = dataset()
assert len(dataset1) == len(dataset2)
__len__
return len(dataset1)
__getitem__
return torch.stack([dataset1.__getitem__(idx),dataset2.__getitem__(idx)]
for i in dataloader():
i = i.view(-1,dimensions)
Wait why should the length of both datasets be the same? Actually, I hope you don’t mind, it’s very hard to understand this code. Can you maybe make it a bit more formal so I can get a good idea of what you are trying to do?