How to make getitem more efficient in loading images?

onedeadmatch · February 12, 2025, 9:25pm

Analyzing the training of my model with the PyTorch profiler, I noticed that most of the time is spent by the CPU (unfortunately, the trace function does not show me any results, the tensorboard page remains blank).

I am more or less sure that the overhead is due to data loading. The dataset consists of two folders, namely one containing annotation .json files and the other jpg images. Considering that the annotations can be entirely loaded into RAM, while all the images cannot, how can the data loading be made more efficient?

   self.loader = torchvision.io.read_image

   def __getitem__(self, index: int) -> Union[torch.Tensor, torch.Tensor, torch.Tensor]:
        actual_index = self.indices[index]
        annotation = self.annotations[actual_index]

        img_path = os.path.join(self.images_path, annotation["filename"])
        img = self.loader(img_path.replace('.json', ".jpg"))

        bboxes = annotation["bboxes"]
        mask = annotation["mask"]

        return img, bboxes, mask

I tried varying the batch size, but the gpu utilization remains almost the same (low).
i have tried using multiple persistent workers, but the wait time between epochs is quite long.

I wonder if it is possible to pre-fetch images from files during traning, so while the GPU is busy. Or possibly, instead of using the getitem function, a method to which you can pass all batch indexes to load them together. In general, if there are more efficient methodologies.

JuanFMontesinos · February 13, 2025, 12:35am

Save the images as numpy arrays if you have enough space. Also if wanna go crazy, load batches instead of independent images and save them contiguous in memory so that you load whole sections of the disk for each batch.

onedeadmatch · February 14, 2025, 7:32am

Thank you very much for your reply! I have tried putting all the images into one HDF5 file, and I am still using the getitem function to always pick them up one by one. Unfortunately, the performance remained about the same, possibly the disk could be the bottleneck, it is an SSD. The images are 1080x1920 in size, maybe this could also be an additional bottleneck factor.

When you mean to load a batch of images at once, how do you think it should be done at the implementation level? I had already tried to load the batch of images in the call function, but pytorch lightning gives me error saying that the getitem function is missing.

Thank you very much again for your time!

iostat -xm 1

JuanFMontesinos · February 14, 2025, 8:12am

You can disable automatic batching when creating a Pytorch dataloader class.
https://pytorch.org/docs/stable/data.html#disable-automatic-batching

So technically, you could load batches of images that are contiguous in memory. This should help with a bottleneck in your hard disk. But indeed the images are quite large. You might want to distribute them among several hard disks?

If at some point you find the CPU being a bottleneck, you could try nvidia dali

It incorporates gpu decoding of jpeg images.

How to make __getitem__ more efficient in loading images?

How to make getitem more efficient in loading images?