DataLoader and parallelism

FiReTiTi · November 20, 2018, 8:49pm

Hi,
I have created a class that extends DataSet to load images for a segmentation task, so one input and one output. Every time the method getitem is called, this class performs the necessary operations for data augmentation on both the input and the output, and it works perfectly.
However, when I use this class with PyTorch DataLoader, the input transformation do not match with the output transformations. My bet is that to perform the same operations, I have to get/set the states of random operations, and the DataLoader does the same.
How can I fix it?
Regards,

pjavia · November 20, 2018, 8:57pm

Can you share the code? Data Loader in PyTorch works like this. You need to define len and self.transform and you need to create class for each transformation with a call definition. The DataLoader will take the length of the dataset and create batches with just numbers(index) and will call the getitem to get the actual item(read the image) and will apply all the transformation that you specify on that item. Hope this clears things a bit.

FiReTiTi · November 20, 2018, 9:11pm

Thank you for your answer.
There is no way to prevent what is happening?
So I must use give the DataLoader the transformations build in classes?

apsvieira · November 20, 2018, 10:56pm

I think the “standard” way, i.e. the most common pattern, is to pass a transform and a target_transform parameters to your custom dataset, and use those tranformations on the __getitem__ function.

The implementation of some standard torchvision datasets can give you a better notion, but this would be the general idea:

class CustomDataset(Dataset):
    def __init__(self, path, transform=None, target_transform=None):
        self.samples = discover_samples_from_path(path)
        self.transform = transform
        self.target_transform = target_transform
        # do something here

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, index):
        img, target = self.samples[index]
        ...
        
        if self.transform is not None:
            img = self.transform(img)

        if self.target_transform is not None:
            target = self.target_transform(target)

        return img, target

From there, you would normally just instantiate your dataset, instantiate a dataloader using it, and start rocking. The transformations will be applied every time you get an instance from the dataset, without need of passing any of them to the dataloader.

my_dataset = CustomDataset('./path/to/data/', my_transforms['data'], my_transforms['target'])
my_dataloader = DataLoader(my_dataset, batch_size=2, num_workers=2, ...)

for i, (img, target) in enumerate(my_dataloader):
    # do something here

FiReTiTi · November 20, 2018, 11:24pm

Thanks a lot for your answer.
When I look at your code, I realize that it’s what I do, but the difference is that I call individual functions/methods for the transformation, not a class. So I guess that by being instantiated in the class, the Random function will be set/get locally and then not impacted if the DataLoader does the same.
I am going to try it, thanks.

FiReTiTi · November 21, 2018, 12:41am

Thanks for your help.
A simple solution is to create a local instance of all the random classes used.
A more extended solution is to put all the transformations in a class as you suggested.
Thanks!