Faster normalisation method than ToTensor (torchvision)

I am trying to convert batches of grayscale images to 0-1. At the moment, I am using transforms.ToTensor to do so but it is too slow as I found that it occupied 80% of the running time after doing some profiling on the code. I am applying the transform on each image one by one without using DataLoader because I am training a dqn with a dynamic data pool. Are there any faster ways to do the same thing? Right now, it is extremely slow as my per image size is 1024*1024. Thank you very much for your help in advance. I have provided a snippet below

        #Random transition batch is taken from experience replay memory
        transitions = self.memory.sample(self.batch_size)
        batch_state = []
        batch_action = []
        batch_reward = []
        batch_state_next_state = []
        batch_done = []
        for t in transitions:
            bs, ba, br, bsns, bd = t
            bs = transform_img_for_model(bs)
            if(self.transforms is not None):
                bs = self.transforms(bs)
            batch_state.append(bs)
            batch_action.append(ba)
            batch_reward.append(br)
            bsns = transform_img_for_model(bsns)
            if(self.transforms is not None):
                bsns = self.transforms(bsns)
            batch_state_next_state.append(bsns))
            batch_done.append(bd)

        batch_state = Variable(torch.stack(batch_state).cuda(async=True), volatile=True)
        batch_action = torch.FloatTensor(batch_action).unsqueeze_(0)
        batch_action = batch_action.view(batch_action.size(1), -1)
        batch_action = Variable(batch_action.cuda(async=True), volatile=True)
        batch_reward = torch.FloatTensor(batch_reward).unsqueeze_(0)
        batch_reward = batch_reward.view(batch_reward.size(1), -1)
        batch_reward = Variable(batch_reward.cuda(async=True), volatile=True)
        batch_next_state = Variable(torch.stack(batch_state_next_state).cuda(async=True), volatile=True)
def transform_img_for_model(image_array):
    image_array_copy = image_array.clone()
    image_array_copy.unsqueeze_(0)
    image_array_copy = image_array_copy.repeat(3, 1, 1)
    return image_array_copy
1 Like

Is the usage of a DataLoader completely impossible? If not, you could probably hide the transform times into the GPU execution of your training procedure.

Anyway, if you need to work in a sequential manner, you might want to save your preprocessed tensors onto your drive and load it for training. Would that be possible or do you need other image processing ops than normalization?

Thanks for your reply. The problem is with the use of experience replay. It is sequential as well as dynamic, so the dataset would be keep on changing. This makes it impossible to use the Dataset and DataLoader class. As for saving, I suspect saving would only introduce way more overhead if I am saving and loading that frequently…Therefore I am really looking for ways where I can do such transformation in the gpu explicitly by batch…Thanks again

OK, if you need to save it frequently, it’ll introduce an overhead, you are right.
I assumed you could save it once “offline” and then just load it in your training script.

What kind of images are you loading, i.e. which format, bit depth etc.?
Maybe we could speed up the loading and transformation by getting rid of some checks so that we trade some generalization for performance.

I am loading in grayscale images of 1024*1024. I had to turn it into 3 channels by repeating it as I am passing it to a pretrained resnet. The images are in dicom format which I use pydicom to load them in. But the major overhead is still in the transforms call…

I believe the issue is in line 88-89 of the functional definition of the ToTensor operation..

# yikes, this transpose takes 80% of the loading time/CPU
img = img.transpose(0, 1).transpose(0, 2).contiguous()
1 Like

Interesting, Is this avoidable?

The devs probably can come up with the pytorch equivalent of np.moveaxis.