Can transformations hinder GPU performance?

I am using a VGG16 model with a few linear layers for a classification task. Due to the complexity of the problem, I had to apply the following transpose as shown in the code below.

class PatchNorm:
    #def __init__(self):

    def __call__(self, patch):
        patch = np.swapaxes(patch[:, :, 0:3], -1, 0)

        patch = np.stack([patch[2, :, :], patch[1, :, :], patch[0, :, :] ])
        patch -= patch.min() 
        patch /= patch.max()

        patch *= 255 # [0, 255] range

        patch= patch.astype(np.uint8)

        patch = np.swapaxes((patch), 0, -1)

        patch = torch.tensor(patch)
        patch = torch.permute(patch, (2, 0, 1))
        patch = transforms.CenterCrop(size=(64, 64))(patch)
        patch = transforms.Resize(size = (224, 224), interpolation = InterpolationMode.BICUBIC)(patch)/255
        patch = transforms.Normalize((0.485, 0.456, 0.406),(0.229, 0.224, 0.225))(patch).to(dtype =torch.float32)
        #patch = torch.permute(patch, (2, 0, 1))
        return (patch)

I noticed that running this model on a TeslaV100 was almost as fast as running it on the cpu (Intel(R) Xeon(R) Silver 4215). I changed the num_workers in the dataloader to 8 and saw a significant rise in speed (from 1.5 hours to 14 mins). But with a higher num_workers the run was a bit laggy.

I was wondering whether a transforms as described above could be the cause of the slow-down? Thank you.

P.S. : Regarding the transforms, I have a .tiff Image with 10 channels, of which I only wish to take 3, reorder, min-max norm, multiply by 255, center crop to utilize only 64x64 pixels, divide by 255 (unnecessary I know), upsample to 224x224, use vgg16’s mean and std, reorder channels and return.

Your PatchNorm class, are you explicitly moving the patch onto GPU before passing it in? If you’re not making a call, then this is likely all happening on cpu anyway.

This class is put in tansforms.compose(). I will try to put in on GPU and let you know the results. Thanks for the advice, it really slipped my mind :smiley:

Your description sounds valid, since the data loading and processing would be performed in the main thread when num_workers=0 is used (which also shows the poor performance).
In such a case the CPU would be blocked by the actual data processing before being able to schedule and launch the kernels. If you profile your workload via e.g. Nsight Systems, you should see a CPU-limited workload, which would most likely show gaps between each kernel launch, as the CPU won’t be able to “run ahead”. Using multiple workers reduces the workload of the main thread, which can then schedule all CUDA kernels in time, reducing the gap between the actual kernel execution, and thus avoid to “starve” the GPU, which should also be visible in a profile.