Low GPU Util with Custom Dataloader Open CV and Numpy Preprocessing

Hi,

i’m facing the issue of low gpu utilisation (only around 15%) and high gpu memory utilisation with pytorch with Windows 10. I already tried to optimize the dataloader.
The Dataloader has the following structure: At first a list of the image paths is collected.

self.filenames = [os.path.join(dp, f) for dp, dn, fn in os.walk(os.path.expanduser(self.images_root)) for f in
                          fn if is_image(f)]
self.filenames.sort()
self.filenamesGt = [os.path.join(dp, f) for dp, dn, fn in os.walk(os.path.expanduser(self.labels_root)) for f in
                            fn if is_image(f)]
self.filenamesGt.sort()

Then the images are opend with PIL:

def __getitem__(self, idx):
    # read input image
    filename = self.filenames[idx]
    filenameGt = self.filenamesGt[idx]
    image_rgb = Image.open(filename)
    image_Gt = Image.open(filenameGt)

Afterwards Torchvision.Transforms are applied with Conversion to ToTensor() at the end. Then the Torch Tensors are inputs to a function containing some OpenCV and numpy functions which produce the final input tensor that is returned by _getitem_ :

input_np = self.create_rgbdm(image_rgb.squeeze(0).numpy().transpose(1,2,0), image_Gt.squeeze(0).numpy())
input_tensor = transforms.ToTensor()(input_np)

The DataLoader arguments in the training script are the following:

train_data_loader = torch.utils.data.DataLoader(
            train_dataset, batch_size=16, shuffle=True,
            num_workers=8, pin_memory=True)

The Problem is that changing num_workers from 0 to 2,4 or 8 does not decrease data loading time significantly and solve the gpu util problem. Is the custom preprocessing function self.create_rgbdm() at the end the problem, when the DataLoader is running with multiple workers? Should the function be called outside of _getitem_ ? Or could something else be the reason?

Pytorch Version: 1.5 with Cuda 10.1
GPU: Nvidia RTX 2080TI

You could profile the data loading using the ImageNet code and check if that’s really the bottleneck.
If so, you might take a look at this post, which explains potential bottlenecks in the data loading pipeline.

Thanks for the quick reply!
When i profile the train script with torch.utils.bottleneck with a single batch of size 24 and setting num_workers = 0, i get this output. So besidesrun_backward the __getitem__ part seems to consume time. When i profile only the Dataloader, i can see that almost the whole time consumed in the _getitem_ function is by PIL image open/convert operations…
Is there anything i can do to make them faster and what could be the reason that increasing num_workers does not help ?

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1   24.711   24.711   24.711   24.711 {method 'run_backward' of 'torch._C._EngineBase' objects}
  6297    7.289    0.001    7.289    0.001 {method 'decode' of 'ImagingDecoder' objects}
    87    3.099    0.036    3.099    0.036 {built-in method conv2d}
     1    2.896    2.896    5.299    5.299 \models\DCCA_sparse_networks.py:373(__call__)
    48    2.000    0.042    2.000    0.042 {method 'resize' of 'ImagingCore' objects}
    86    1.983    0.023    1.983    0.023 {method 'to' of 'torch._C._TensorBase' objects}
 36871    1.910    0.000    1.910    0.000 {built-in method matmul}
    48    1.321    0.028    1.321    0.028 {method 'copy' of 'ImagingCore' objects}
    24    0.741    0.031   12.152    0.506 \dataloaders\own_dataset_loader.py:100(__getitem__)
     4    0.450    0.112    0.450    0.112 {built-in method conv_transpose2d}
  1001    0.333    0.000    0.333    0.000 {built-in method io.open_code}
160/158    0.295    0.002    0.298    0.002 {built-in method _imp.create_dynamic}
     2    0.293    0.146    0.293    0.146 {built-in method symeig}
    33    0.235    0.007    0.235    0.007 {method 'uniform_' of 'torch._C._TensorBase' objects}
    37    0.222    0.006    0.222    0.006 {method 'normal_' of 'torch._C._TensorBase' objects}

It depends if the data loading is the bottleneck or the data processing.
In the latter case, you could use PIL-SIMD, which should speed up the processing.
As explained in the linked post, you should also make sure to store the data on a local SSD to avoid data loading bottlenecks.