Passing dataset through random_split required for training to converge

etrommer · February 10, 2022, 10:27am

Hi, I am running into a slightly odd problem when using a Dataloader (wrapped in PyTorch Lightning DataModule). I’m trying to train a VGG network using the TinyImageNet data set. I have reorganized the validation set to have the same file structure as the training set. If I load my dataset like this:

def setup(self, stage=None):                                                                                                           
    if stage == "fit" or stage is None:
         t = transforms.Compose(self.augment + self.normalize) 
        self.df_train = datasets.ImageFolder(os.path.join(self.data_dir, 'train'), transform=t)                                                                                                                                                                                 
         t = transforms.Compose(self.normalize)
        self.df_val = datasets.ImageFolder(os.path.join(self.data_dir, 'val'), transform=t)

the training does not converge (i.e. loss goes to a high value during the first epoch and then remains there. Validation accuracy remains at chance level during the entire time). However, if I change it to:

def setup(self, stage=None):                                                                                                           
    if stage == "fit" or stage is None:
         t = transforms.Compose(self.augment + self.normalize) 
         ds_full = datasets.ImageFolder(os.path.join(self.data_dir, 'train'), transform=t)        
         
        ### Pass training dataset through random split (this should be a no-op, no!?)
        self.df_train, _ = td.random_split(ds_full, [100000, 0])                                                                                                           
         
         t = transforms.Compose(self.normalize)
        self.df_val = datasets.ImageFolder(os.path.join(self.data_dir, 'val'), transform=t)

(with no other changes to code, hyperparameters, etc.)
The training loss and validation loss go down and accuracies go up like I would expect them to. Versions that I am using:

>>> import torch
>>> torch.__version__
'1.9.1+cu102'
>>> import pytorch_lightning as pl
>>> pl.__version__
'1.4.9'

Does anyone have any idea what is going wrong here? If this is a genuine issue, I’m happy to file a bug report. I just want to rule out the possibility that I am doing something wrong.

ptrblck · February 11, 2022, 12:36am

I assume your ds_full contains 100000 samples and you are basically applying random_split on ds_full to get df_train back.
In this case, random_split will shuffle the indices which would make a difference in the model convergence (if you are not shuffling in the DataLoader).

etrommer · February 11, 2022, 7:38am

Yes, you are right, 100000 is the size of the dataset and I did not think of applying shuffle() to the DataLoader. That certainly explains it. Thank you very much!