Dynamic Dataloaders for on the fly modifications

kpprasa · April 28, 2020, 7:50pm

Hello,
I am working on a project where we are trying to modify the data every n epochs. I have two questions related to this:

Can we use a single dataloader and dataset to do this?
ie every 5 epochs jitter the images in the trainset
Would it be better from a computational standpoint to perform these custom transforms in a modified dataset class or in the training loop itself after getting the natural images?

Thank you,

ptrblck · April 29, 2020, 3:26am

If your use case allows to manipulate the dataset after a full epoch, I would recommend to perform the manipulations directly on the data inside the Dataset and just recreate the DataLoader.
Instantiation of DataLoaders should be cheap, so you shouldn’t see any slow down.

Also, manipulating the data on the fly inside a DataLoader loop might now work, if you are using multiple workers, so you would be forced to use num_workers=0 or use some shared memory approach.

kpprasa · April 29, 2020, 12:30pm

Ah great, thanks! Just as a followup, the data loading is what is distributed, right? So if I modify the data in a lazy format just prior to running the model on it, then there is no need to worry about the number of worker nodes if I don’t want to save the new data, right?

ptrblck · April 30, 2020, 2:27am

In a distributed setup using DistributedDataParallel, you would use a single process for each device.
Each process would then take care of its data loading.

Could you give an example of your use case and how it could interact in a bad way?

kpprasa · April 30, 2020, 2:22pm

Ah thank you,
I think I miscommunicated. We are not moving to a truly distributed setup. I was more curious as to the distribution of the data loading when we specify num_workers amongst the workers. But I took a look at the source code and think I understand.

Your first post helped a lot and we have the dynamic dataloader working.
Thank You!

jasg · May 1, 2020, 5:56pm

I dont understand, why num_workers = 0 improves performance

kpprasa · May 1, 2020, 9:20pm

It’s not actually improving performance; it’s attempting to allow modification of a dataloader while training. Usually, the data loading is done by multiple workers and the changes wouldn’t be reflected across all worker threads. Thus, if num_workers= 0, then we do not have this issue.

kpprasa · May 1, 2020, 9:33pm

Note for anyone trying to do this in the future:
If you create a separate member variable in the dataset with the transforms, you can change that variable and just modify the __get_item__ method to have an expanded length and index

here’s a simple example:

class Dynamic_dataset(dataset):


    def __init__(
        self , data):
        

        self.orig_train_size = self.data.shape[0]
        
        self.transforms = []
     

    

    def add_transforms(self, transforms):
        self.transforms = transforms


    

    def __getitem__(self, index):
        """
        Args:
            index (int): Index
        Returns:
            tuple: (image, label) where target is class_index of the target class.
        """
        # For indexes past the original data, find from transforms
        if index >= self.orig_train_size:
                # Use the original data
                image, label = (
                    self.data[int(index % len(self.data))],
                    self.targets[int(index % len(self.data))],
                )
                image = torch.FloatTensor(image)
                pert_index = index - self.orig_train_size
                image = image + self.perturbations[pert_index]
                image = torch.clamp(image, 0.0, 1.0)
        else:
            image, label = self.data[index], self.targets[index]
            image = torch.FloatTensor(image)

        return image, label

    def __len__(self):
        return self.data.shape[0] + len(self.transforms)

You can access the method by
train_loader.dataset.add_transforms(transforms) (once you define your train_loader…)

jasg · May 5, 2020, 6:43pm

I don’t understand how I would adapt that in my code, please support me

          train_transform = transforms.Compose([
                    transforms.RandomRotation(10),      # rotate +/- 10 degrees
                    transforms.RandomHorizontalFlip(),  # reverse 50% of images
                    transforms.Resize(image),             # resize shortest side to 224 pixels
                    transforms.CenterCrop(image),         # crop longest side to 224 pixels at center
                    transforms.ToTensor(),
                    transforms.Normalize([0.485, 0.456, 0.406],
                                         [0.229, 0.224, 0.225])
                ])
            
                test_transform = transforms.Compose([
                    transforms.Resize(image),
                    transforms.CenterCrop(image),
                    transforms.ToTensor(),
                    transforms.Normalize([0.485, 0.456, 0.406],
                                         [0.229, 0.224, 0.225])
                ])
                
                inv_normalize = transforms.Normalize(
                mean=[-0.485/0.229, -0.456/0.224, -0.406/0.225],
                std=[1/0.229, 1/0.224, 1/0.225]
                )
                
                print("RGB")
                
            
            train_data = datasets.ImageFolder(os.path.join(root, 'train_real'), transform=train_transform)
            test_data = datasets.ImageFolder(os.path.join(root, 'validation'), transform=test_transform)
            
            
            
            torch.manual_seed(42)
            

            train_loader = DataLoader(train_data, batch_size=10,num_workers=2, pin_memory=False,shuffle=True)
            test_loader = DataLoader(test_data, batch_size=10, num_workers=2,pin_memory=False,shuffle=True)
            
            #obtiene los labels o clases del dataset
            class_names = train_data.classes
            
            print(class_names)
            print(f'Training images available: {len(train_data)}')
            print(f'Testing images available:  {len(test_data)}')

kpprasa · May 7, 2020, 6:37pm

Do you need custom dynamic transformations? Most standard applications can just use:
https://pytorch.org/docs/stable/torchvision/transforms.html
the transforms above. I am working on some research regarding adversarial robustness that required custom transformations on the fly.

If you tell us your use case, we can point you in the right direction.

Jaideep_Valani · January 10, 2021, 7:39pm

@ptrblck my requirement in this regard is to update the labels after epoch ,as I consider those labels incorrect ,so correcting it for next epoch .
How can we do that …

ptrblck · January 11, 2021, 2:18am

If you are using persistent_workers=False in the DataLoader (the default setup), you should be able to manipulate the underlying Dataset after each epoch even with num_workers>0.

sqiangcao · July 25, 2022, 6:49am

@ ptrblck, hi, thanks for you answer. I use data_loaders.sampler.__init__(data_loaders.dataset) to solve this problem in my case. Are there potential bugs?