Train simultaneously on two datasets

Hello,

I should train using samples from two different datasets, so I initialize two DataLoaders:

train_loader_A = torch.utils.data.DataLoader(
             datasets.ImageFolder(traindir_A),
             batch_size=args.batch_size, shuffle=True,
             num_workers=args.workers, pin_memory=True)

train_loader_B = torch.utils.data.DataLoader(
             datasets.ImageFolder(traindir_B),
             batch_size=args.batch_size, shuffle=True,
             num_workers=args.workers, pin_memory=True)

What is the best way to extract samples from both iterators? In order to use something like this:

for i, (input, target) in enumerate(train_loader):
…

Thanks.

8 Likes

I’d recommend creating a new dataset and concatenating the images there, so the copy will be done inside the worker processes:

class ConcatDataset(torch.utils.data.Dataset):
    def __init__(self, *datasets):
        self.datasets = datasets

    def __getitem__(self, i):
        return tuple(d[i] for d in self.datasets)

    def __len__(self):
        return min(len(d) for d in self.datasets)

train_loader = torch.utils.data.DataLoader(
             ConcatDataset(
                 datasets.ImageFolder(traindir_A),
                 datasets.ImageFolder(traindir_B)
             ),
             batch_size=args.batch_size, shuffle=True,
             num_workers=args.workers, pin_memory=True)

for i, (input, target) in enumerate(train_loader):
    ... 
27 Likes

Perfect! Thank you for your help.

However, __getindex__ should be __getitem__, correct?

Yes, that’s a typo. Sorry. I’ve changed the code.

Any idea on how to know, on the merged data whether the sample is coming from datasetA or datasetB in this case? I am trying to train a GAN-like network which needs to sample from two data sources and therefore needs the label in this case (A or B).

Sorry to wake-up an old thread

3 Likes

I’ve read the code of ConcatDataset. One question is that: how can we control the ratio of data coming from a specific dataset?

For example, dataset A contains 100 data, and dataset B contains 10000 data. How can we get 1:1 data from both A and B in one mini-batch? Does anybody have any idea? Thank you.

19 Likes

To add to platero’s reply, suppose for example that datasetA contains 100 elements and datasetB contains 10000. My impression is that the data loader will (in one epoch) create shuffled indices 1…100 for datasetA and shuffled indices 1…100 for dataset B and create batches from each of those (since the len of ConcatDataset is the minimum of the lengths of both A and B). However, datasetB also has elements from 101…10000, so these will not be accessed. Am I correct in my intuition here? If so, this doesn’t seem like a reasonable solution if one dataset is way smaller than the other.

2 Likes

The current implementation does not discard data, it randomly samples from all the concatenated datasets. This means that each dataset will be sampled with probability len(dataset)/len(all_datasets)

To change this behavior, I think one could use WeightedRandomSampler, setting the appropriate weights.

1 Like

Hi Dear Adam
I want to train my models simultaneously on two datasets, but I want to pick batches in the same order with shuffle=True. but targets1 and targets2 are not same. For example:


train_dl1 = torch.utils.data.DataLoader(train_ds1, batch_size=16, 
                                       shuffle=True, num_workers=8)
train_dl2 = torch.utils.data.DataLoader(train_ds2, batch_size=16, 
                                       shuffle=True, num_workers=8)
inputs1, targets1 = next(iter(train_dl1))
inputs2, targets2 = next(iter(train_dl2))

targets1
tensor([ 1,  1,  0,  1,  0,  0,  1,  1])

targets2
tensor([ 0,  0,  0,  0,  0,  0,  0,  1])

I want to get targets1 and targets2 with the same order with the same shuffle. are you have any idea and can you help me?

4 Likes

hey! this really helps but what if I have different samplers for both the data loaders? How do I train them simultaneously then?

The same with you, do you have any ideas? thank you

class ConcatDataset(Dataset):

def __init__(self,*datasets):
    self.datasets = datasets
    self.data_files =os.listdir('../data/cat1/cat05')

def __getitem__(self, i):

    return tuple(d[i] for d in self.datasets)

def __len__(self):
    return  min(len(d) for d in self.datasets)

#def loadImgs(des_dir = “…/data/”,img_size=128,batchSize = 4):

dataset1 = dset.ImageFolder(root=“…/data/cat1/”,
transform=transforms.Compose([
transforms.Resize(128),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
]))
dataset2 = dset.ImageFolder(root=“…/data/cat2/”,
transform=transforms.Compose([
transforms.Resize(128),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
]))
#
#print(dataset)
dataloader = torch.utils.data.DataLoader(
ConcatDataset(dataset1,dataset2),
batch_size= 4, # how many samples per batch to load
shuffle=True)

dataloadercat1 = torch.utils.data.DataLoader(
dataset1,
batch_size= 4, # how many samples per batch to load
shuffle=True)

#inputs1,targets1 = next(iter(dataset1))
#inputs2,targets2 = next(iter(dataset2))

for epoch in range(1):
for i,(inputs,lables) in enumerate(dataloader):

    print("Batch all ",i,lables)
for i, (inputs, lables) in enumerate(dataloadercat1):
    print("Batch cat ", i)
print(ConcatDataset(dataset1,dataset2).datasets)

After implementation ConcatDataset with over data set than only smaller size subfolder is consider in batch formation. this reason some image of other category is missed out. you have any idea how to solved out this problem.?

when i have developed this concept in my dataset than at the time of batch formation only smaller size data is consider in batch formation.
e.g data folder i have two subfolders cat1 and cat2. in cat1 70 images and cat2 30 images . Dataloader function make only 30 image considered in batch formation. 40 images of cat2 is not considered in batch formation.? can you have any solution which consider remaning image also

Hi there, I have managed to use two datasets by creating a custom dataset that takes in two root directories:

class dataset_maker(Dataset):
    def __init__(self, root_dir1, root_dir2, transform= None):
        self.root_dir1=root_dir1
        self.root_dir2=root_dir2
        self.filelist1 = glob.glob(root_dir1+'*.png')
        self.filelist2 = glob.glob(root_dir2+'*.png')
        self.transform=transform
        
    def __len__(self):
        return min(len(self.filelist1),len(self.filelist2))
    
    def __getitem__(self, idx):
        sample1 = io.imread(self.filelist1[idx])/65535*255
        sample2 = io.imread(self.filelist2[idx])/65535*255
        sample1=np.uint8(sample1)
        sample2=np.uint8(sample2)
        sample1=PIL.Image.fromarray(sample1)
        sample2=PIL.Image.fromarray(sample2)
        if self.transform:
            sample1 = self.transform(sample1)
            sample2 = self.transform(sample2)
        return sample1,sample2

then, make a dataloader using the two datasets:

dataloader = DataLoader(combined_dataset, batch_size=3, shuffle=True, num_workers=4)

Finally, I get the data in the training loops by doing this call in the for loop:

for epoch in range(10):
    running_loss=0.0
    
    #get the data
    for batch_num, (hq_batch,Lq_batch) in enumerate(dataloader):
        print(batch_num, hq_batch.shape, Lq_batch.shape)

The output is stated below:

0 torch.Size([3, 3, 256, 256]) torch.Size([3, 3, 256, 256])
1 torch.Size([3, 3, 256, 256]) torch.Size([3, 3, 256, 256])
2 torch.Size([3, 3, 256, 256]) torch.Size([3, 3, 256, 256])
3 torch.Size([3, 3, 256, 256]) torch.Size([3, 3, 256, 256])
4 torch.Size([3, 3, 256, 256]) torch.Size([3, 3, 256, 256])
5 torch.Size([3, 3, 256, 256]) torch.Size([3, 3, 256, 256])
6 torch.Size([3, 3, 256, 256]) torch.Size([3, 3, 256, 256])
7 torch.Size([3, 3, 256, 256]) torch.Size([3, 3, 256, 256])
8 torch.Size([3, 3, 256, 256]) torch.Size([3, 3, 256, 256])
9 torch.Size([3, 3, 256, 256]) torch.Size([3, 3, 256, 256])
10 torch.Size([3, 3, 256, 256]) torch.Size([3, 3, 256, 256])
11 torch.Size([3, 3, 256, 256]) torch.Size([3, 3, 256, 256])
12 torch.Size([3, 3, 256, 256]) torch.Size([3, 3, 256, 256])

Hope this solves the problem!

3 Likes

how this combined_dataset is to be designed? Is that to be instance of Concat. I got this error while tried to form the combined dataset using dset.ImageFolder, :“TypeError: expected str, bytes or os.PathLike object, not ConcatDataset”

I’m not sure if the for loop is setup correctly since __getitem__ is returning ( data points for A, labels for A) , (data points for B, labels for B) ).

1 Like

How do you train your model ?
You had passed batch_size =3 in the dataloader, but you got two batches of 3 sample. How do you use it to train model?

Did you find the solution for this? I.e when samplers are not of same size?
Another question,. Can we use concat dataset for more than 2 samplers?

Hi There,

Suppose I have two training dataset with different size and I am trying to train it on a network simultaneously, So I can do it? also, I need to keep a track of from which dataset image is coming to find out the loss after each iteration by the equation:

where,
L0 & L1 are the lengths of the dataset and Lambda is a balancing constant.

Thank you.