Load Data and Train simultaneously on two datasets

mostafaaminnaji · May 20, 2018, 1:46pm

Hello Friends
I want to train my models simultaneously on two datasets, but I want to pick batches in the same order with shuffle=True. but targets1 and targets2 are not same. For example:


train_dl1 = torch.utils.data.DataLoader(train_ds1, batch_size=8, 
                                       shuffle=True, num_workers=8)
train_dl2 = torch.utils.data.DataLoader(train_ds2, batch_size=8, 
                                       shuffle=True, num_workers=8)
inputs1, targets1 = next(iter(train_dl1))
inputs2, targets2 = next(iter(train_dl2))

but

targets1
tensor([ 1,  1,  0,  1,  0,  0,  1,  1])

targets2
tensor([ 0,  0,  0,  0,  0,  0,  0,  1])

I want to get targets1 and targets2 with the same order with the same shuffle. Are you have any idea and can you help me?

justusschock · May 20, 2018, 2:01pm

I guess you could use ConcatDataset for this.

Sorry I’m currently typing from my mobile phone and thus I am unable to provide some sample code

mostafaaminnaji · May 20, 2018, 6:30pm

@justusschock Dear Justus, thanks for your response.
Yes, I know what is “ContactDataset”. But it mixes all Dataset Randomly.
Actually, I want to train two network which each network should train on the special dataset and finally I want to use their result to another network (For ensemble learning) while I should keep the shuffle.
In another word, I want to “targets1” and “targets2” be same.
Thanks again for your answer, my friend.

train_dl1 = torch.utils.data.DataLoader(train_ds1, batch_size=8,  shuffle=True, num_workers=8)                                 
train_dl2 = torch.utils.data.DataLoader(train_ds2, batch_size=8,                                      shuffle=True, num_workers=8)
inputs1, targets1 = next(iter(train_dl1))
inputs2, targets2 = next(iter(train_dl2))
targets1
tensor([ 1,  1,  0,  1,  0,  0,  1,  1])
targets2
tensor([ 0,  0,  0,  0,  0,  0,  0,  1])

ptrblck · May 20, 2018, 8:04pm

Are both datasets identical or different?
If they are identical and you would like to sample in a random but defined order, you could use a SubsetRandomSampler, where you can define your own indices, for both DataLoaders.

justusschock · May 20, 2018, 8:26pm

I’m sorry that I didn’t understand your requirements clearly.

Another approach could be to define your own Dataset which returns two tuples instead of one (one per dataset each with data and label) in the __getitem__ method. This way you could define the way, the data processing and indexing is done. I will try to provide some sample code with my mobile phone. I cannot guarantee it to work, but it should give you an idea how to do it.

EDIT: Dataset example code:

class DoubleDataset(Dataset):
    
    def __init__(self, path_1, path_2, transforms=None):
        self.data_1 = [os.path.join(path_1, x) for x in os.listdir(path_1) if x.endswith(".png")]

        self.data_2 = [os.path.join(path_2, x) for x in os.listdir(path_2) if x.endswith(".png")]

        self.transform = transforms

    def __getitem__(self, index):
        _data1 = custom_load_fn(self.data_1[index])
        _data1 = custom_load_fn(self.data_2[index])

        if self.transforms is not. None:
            _data1 = self.transforms(_data1)
            _data2 = self.transforms (_data2) 

        return _data1, _data2

    def __len__(self):
        return min(len(self.data_1), len(self.data_2))

This way you could train your networks like:

for data1, data2 in dataloader:
    pred1 = model1(data1[0])
    pred2 = model2(data2[0])

    optim1.zero_grad()
    loss1 = loss_fn(pred1, data1[1])
    loss1.backward()
    optim1.step()

    optim2.zero_grad()
    loss2 = loss_fn(pred2, data2[1])
    loss2.backward()
    optim2.step()

If setting shuffle=True the data loader gives a random index to the dataset and the dataset then returns items of both datasets with the same index

qiminchen · April 24, 2019, 11:03pm

Hi,

did you already solve this problem? How? thanks

saba · June 23, 2020, 5:03am

Hi Ptrblck,

I want to use 2 different dataset but simultaneously. I used

 for ii , data in enumerate (trainloader)
~~~
to get my data but now I want to get another data from another dataloadr in same time same batch size.
is it correct to use:

~~~ 
   images1, targets1 = next(iter(trainloader))      
   images2, targets2 = next(iter(trainloaderNeg))   

   ~~~~

ptrblck · June 23, 2020, 5:11am

The last two lines of code would recreate the iterator, so you should split the calls to:

data_iter1 = iter(trainloader)
data_iter2 = iter(trainloaderNeg)

images1, targets1 = next(data_iter1)
images2, targets2 = next(data_iter2)
...

Note that you would have to catch the StopIteration once the iterators are empty.

saba · June 23, 2020, 6:27am

many thanks for your help.
Sorry if I want to multiply to dataset (a and b) with the size of (64,1,21,21), the torch.mul(a,b) does work properly?

ptrblck · June 23, 2020, 6:46am

Which error are you getting and what are the shapes of a and b?

saba · June 24, 2020, 12:31am

Did not give me any error, just I want to know the way of multiplication is safe/
a is the batch of positive patches with size of 21x 21 and b is the batch of negative patch with the same size

ptrblck · June 24, 2020, 4:23am

I’m not familiar with your use case, but

a = torch.randn(6, 1, 21, 21)
b = torch.randn(6, 1, 21, 21) 
torch.mul(a, b)

would perform an element-wise multiplication.

saba · July 3, 2020, 4:37am

Hi Ptrblck,

I hope you are well. Sorry for GAN , I nned to compute the SWD distance I found this link (“https://github.com/koshian2/swd-pytorch/blob/master/README.md”)
Do you think the code is reliable to use?

I appreciate you if suggest me a link for FID score too. I found another link but I am not sure if they are reliable to use or not?

Cheers
Saba

ptrblck · July 3, 2020, 7:04am

Unfortunately, I haven’t used this repository, but would recommend to just try it out, run some quick tests, and make sure your dummy examples return the expected outputs.

saba · July 5, 2020, 2:00am

Sorry, for using the FID score in GAN. Do you think is it good enough to implement the metric with Numpy and compute the FID score?

Or the idea is worthy when we use the pre-trained inception model?

Ikram_Hattab · March 29, 2021, 9:10pm

Hi Guys,

so, I want to train my model on two datasets (RGB and thermal images) Capture , and I want to pick batches in the same order with shuffle=True.

i already have a function create_dataloader :

def create_dataloader(path, imgsz, batch_size, stride, opt, hyp=None, augment=False, cache=False, pad=0.0, rect=False,

                  rank=-1, world_size=1, workers=8, image_weights=False, quad=False, prefix='', shuffle=True):

# Make sure only the first process in DDP process the dataset first, and the following others can use the cache

with torch_distributed_zero_first(rank):

    dataset = LoadImagesAndLabels(path, imgsz, batch_size,

                                  augment=augment,  # augment images

                                  hyp=hyp,  # augmentation hyperparameters

                                  rect=rect,  # rectangular training

                                  cache_images=cache,

                                  single_cls=opt.single_cls,

                                  stride=int(stride),

                                  pad=pad,

                                  image_weights=image_weights,

                                  prefix=prefix,

                                  shuffle=shuffle)

batch_size = min(batch_size, len(dataset))

nw = min([os.cpu_count() // world_size, batch_size if batch_size > 1 else 0, workers])  # number of workers

sampler = torch.utils.data.distributed.DistributedSampler(dataset) if rank != -1 else None

loader = torch.utils.data.DataLoader if image_weights else InfiniteDataLoader

# Use torch.utils.data.DataLoader() if dataset.properties will update during training else InfiniteDataLoader()



dataloader = loader(dataset,

                    batch_size=batch_size,

                    num_workers=nw,

                    sampler=sampler,

                    pin_memory=True,

                    collate_fn=LoadImagesAndLabels.collate_fn4 if quad else LoadImagesAndLabels.collate_fn)

return dataloader, dataset

i’m trying to create 2 data loader for each datasets like that :

Trainloader

dataloader1, dataset1 = create_dataloader("../Flir", imgsz, batch_size, gs, opt,

                                        hyp=hyp, augment=True, cache=opt.cache_images, rect=opt.rect, rank=rank,

                                        world_size=opt.world_size, workers=opt.workers, 

                                        image_weights=opt.image_weights, quad=opt.quad, prefix=colorstr('train: '), shuffle=True)

dataloader2, dataset2 = create_dataloader("../FlirRGB", imgsz, batch_size, gs, opt,

                                        hyp=hyp, augment=True, cache=opt.cache_images, rect=opt.rect, rank=rank,

                                        world_size=opt.world_size, workers=opt.workers,

                                        image_weights=opt.image_weights, quad=opt.quad, prefix=colorstr('train: '), shuffle=True)

But i didn’t have a right solution.
can you help me please.

ptrblck · March 30, 2021, 7:29am

I think the proper approach would be to write a custom Dataset (e.g. as given in this tutorial) and load the images from both folders simultaneously. This would make sure to get the desired image pairs in the same order without trying to use the seeds etc.
Let me know, if this approach would work.

Ikram_Hattab · April 5, 2021, 10:22am

Not yet.
I have problem in datasets.

I split my dataset to train ,val and test and I already have 3716 images for train.
So when i train my dataset “FlirRGB” (!python train.py --img 640 --batch 10 --epochs 12 --data FlirRGB.yaml --weights yolov3.pt) the system found only 2112 images (and 0 missing).

Capture
I have already jpg format.

So please how i can solve my problem.

ptrblck · April 6, 2021, 6:38am

I’m not sure how you’ve created the Dataset, but I assume you are using a custom implementation.
If that’s the case, note that the __len__ method of the Dataset returns the number of samples, which can be drawn from this dataset.
If you expect the train_dataset to have a length of 3716, while only 2112 samples are returned, could you check the return value of print(len(train_dataset)) and see, how what the __len__ method is returning (i.e. how the length is calculated)?

Ikram_Hattab · April 11, 2021, 8:11pm

I have sorted the contents of trainRGB.txt and It worked.