Using twice the dataset size results in 15x training time

Hi forum,

I am experiencing a very peculiar issue and I was wondering if it has happened to anyone else.

I am working on an instance segmentation task using the DAVIS2017 dataset. I have pre-computed features offline using a pre-trained network and have stored them in .pt files in the disk. There are two variants of the dataset: the original one, and a mirrored one (i.e. data augmentation). I have pre-computed features for both.

Here is the odd thing: Without changing anything except for the path to the data that I am loading (original v.s. original + mirrored), when I am training with the original dataset’s pre-computed features it takes about 200-300 seconds per epoch. If I load the augmented dataset instead (2x size of original dataset), every epoch takes 4000-6000 seconds. I am using the same data loader and the same training loop, only thing I am changing is the path to the files.

Here is part of my data loader:

class DAVIS2017Dataset(Dataset):
    def __init__(self, data_path, label_path, data_augmentation=False):
        self.data_augmentation = data_augmentation
        self.feature_paths = self._get_pre_computed_data(data_path)  # This just returns a list of paths to the features.
        self.label_paths = self._get_labels(label_path)
        assert(len(self.feature_paths) == len(self.label_paths))
        self._n = len(feature_paths)

    def __len__(self):
        return self._n

    def __getitem__(self, idx):
        features = torch.load(self.feature_paths[idx])
        labels = torch.load(self.label_paths[idx])
        return {'features': features, 'labels': labels}


def get_DAVIS2017_loader(data_path, label_path, data_augmentation=False):
    dataset = DAVIS2017Dataset(data_path, label_path, data_augmentation)
    loader = DataLoader(dataset, batch_size=16, pin_memory=True)
    return loader

My training loop is a standard training loop, where I call .to(device) only after I get the batch from the data loader.

Any insights are appreciated, I’ve been trying to debug this for days now.

Are both datasets stored on the same device?
If so, could you remove the training code and just time the data loading or are the results already from a similar experiment?

Thank you for your reply Piotr.

Both datasets were on the same device indeed. I never tried to remove the training code and test the data loader alone, but here is what I did:

  • I added logs on every operation (data loading, training) with time and time elapsed.
  • I also used the profiler from torch.utils.bottleneck, which is a magnificent tool.
  • I made sure I was not storing the computation graph history (e.g. by storing the loss value).
  • I searched around this forum extensively.

This investigation more or less made me think that there is a problem with the device. Indeed, it turned out to be an HDD and not an SDD. The consensus to approach this is:

  • Increase the num_workers in the data loader.
  • Use an SSD.

Which I did, and it fixed the issue, in the sense that now the model trains quicker and proportionally to the data size.

I still don’t understand why this is a fix, and why the time gap was so disproportional when I was using the HDD. But at least I can continue with my experiments!

Cheers.

As far as I know, HDDs suffer a lot from random reads, so the mirrored dataset might have been stored in a “more scattered way”?
As you can see, I’m not deeply familiar with how data is stored on the disc, but I would highly recommend to use SSDs :wink: