Faster the training of DL models by pre-transformation

If you ever trained a model, you sometimes can see DataLoader (e.g. I am using pytorch Dataloader) could be the bottleneck because everytime you get an training sample from the dataset, the data transformation will be performed on-the-fly. Here I take the getitem func of DatasetFolder from torchvision.datasets as an example.

    path, target = self.samples[index]
    sample = self.loader(path)
    if self.transform is not None:
        sample = self.transform(sample)
    if self.target_transform is not None:
        target = self.target_transform(target)

    return sample, target

I wonder can we pre-processed the images (e.g. ImageNet) in advance to tensors and save to disk. Then we modify the __getitem__ function to get these tensors directly from disk. How efficient is this approach? Anyone has tried this solution before?

I think that maybe the loading from disk will burden and likely become a new bottleneck (instead of data transform we have before). Another thing is the size, for example, one ImageNet image takes 74 MB when being saved as tensors using standard transformation:

                transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])

I think you are spot-on here. In my experience, decoding image files often is much faster than than the time it takes to load uncompressed images from disk.

That said, you might loom into pre-resizing to a comon (maybe a bit larger than the final) size, save as images, and just do ToTensor in the dataset and anything else on the GPU. That can give great speedups.

Best regards


1 Like

Hi Thomas. I would be great if you give me an example of doing anything else on gpu. Did you mean anything in transformation? I suppose you are mentioning ab the normalization, right?

So in this case per you suggested, the standard data transform will only contain ToTensor. Normalize will be called later once we move the tensors to gpu?

So it also does other things, but the grapevine livedemo notebooks from this repo take a model and tune the dataloading:

There we end up loading the dataset into memory, which in general cannot be expected to work, but other considerations might apply to your case as well.

Beat regards


1 Like