Can you use torch.backends.cudnn.benchmark = True after resizing images?

himat · March 22, 2019, 7:48pm

The thread at What does torch.backends.cudnn.benchmark do? says that you can set torch.backends.cudnn.benchmark = True if your input sizes for your network don’t vary.

So is this fine to enable if I resize my images to be the same size in the dataloader at every iteration, or is this considered having different input sizes?

Kushaj · March 22, 2019, 8:22pm

When it says the input sizes for your network must be same, it means that the images that you input to your model say Resnet should be of same size at every iteration for maximum performance.

When you enable cudnn benchmark, what it does is, before beginning the training of your model it optimizes your model based on your inputs for maximum performance on the gpu. That is why when you start training with the benchmark=True, it takes some time before the actual training starts.

Now if your input size is changing at every iteration then it will do the optimization at every epoch, making the learning process slower.
Note:-By changing input size, I am not referring to the size of images that the dataloader reads from disk. I am referring to the output images from the dataloader.

Yes, it is fine to resize your images at every iteration with the benchmark=True

himat · March 25, 2019, 5:43pm

I see, this is the most clear explanation I’ve found!
So this would only be a problem to enable if I had some kind of fully convolutional network right? Because with a normal CNN, I would need to have a fixed size input normally to make sure all the dimensions match throughout the layers of the network?

Kushaj · March 25, 2019, 9:40pm

I am not able to think of a CNN example now, but you can see the difference in NLP tasks where the sequence length changes.

ptrblck · March 26, 2019, 1:10am

For the sake of completeness: if you are using adaptive pooling layers, you can relax the shape requirements for the input so that CNNs with linear layers will also take variable sized images.

Kushaj · March 26, 2019, 10:24am

Do you know of any literature behind the adpative pooling layers. fastai uses it, I have also used it but I am not able to find much literature behind it.

ptrblck · March 26, 2019, 10:31am

I’m not sure if there is much literature (at least I don’t know any), as the adaptive pooling layers just provide a convenient way to avoid calculating the kernel shapes for each new input size manually.
Since these layers do not store any internal parameters, you could also use the functional API of the pooling operation and compute the kernels yourself for the current input.

Kushaj · March 26, 2019, 10:33am

I think the source code of the layers would provide sufficient docs.

Sanjayvarma11 · May 13, 2020, 1:12am

Hi ptrblck.It’s me again.I hope you are okay.Okay i just learned that there is a parameter torch.backends.cudnn.benchmark=True.It will increase speed of training.But i didn’t found any example on this even in pytorch documentation.So here is my training code.Can you tell me where to use this parameter.Thank you.

class Train:
  def __init__(self, model, dataloader, optimizer, stats, scheduler=None, L1lambda = 0,criterion=None,use_amp=True):
    self.model = model
    self.dataloader = dataloader
    self.optimizer = optimizer
    self.scheduler = scheduler
    self.stats = stats
    self.L1lambda = L1lambda
    self.criterion=criterion
    self.loss1=0.0
    self.loss2=0.0
    self.loss=0.0
    self.use_amp=True
  def run(self):
    self.model.train()
    torch.cuda.empty_cache()
    pbar = tqdm_notebook(self.dataloader)
    for data1,data2,target1,target2 in pbar:
      # get samples
      data1,data2 = data1.to(self.model.device), data2.to(self.model.device)
      target1, target2 = target2.to(self.model.device), target2.to(self.model.device)
      self.optimizer.zero_grad()
      output1,output2 = self.model(data1,data2)
      #print(target.shape)
      target1=target1.unsqueeze(1)
      target2=target2.unsqueeze(1)
      self.loss1=self.criterion[0](output1.float(), target1.half())
      self.loss2=self.criterion[1](output2.float(), target2.half())
      #print("loss1 {}".format(self.loss1))
      #print("loss2 {}".format(self.loss2))
      self.loss=(self.loss1+self.loss2)/2.0
      
      # In PyTorch, we need to set the gradients to zero before starting to do backpropragation because PyTorch accumulates the gradients on subsequent backward passes. 
      # Because of this, when you start your training loop, ideally you should zero out the gradients so that you do the parameter update correctly.

      # Predict
      
      #Implementing L1 regularization
      #print(self.loss)
      if self.L1lambda > 0:
        reg_loss = 0.
        for param in self.model.parameters():
          reg_loss += torch.sum(param.abs())
        self.loss += self.L1lambda * reg_loss

      #print(self.loss)
      # Backpropagation
      if self.use_amp:
        with amp.scale_loss(self.loss, self.optimizer) as scaled_loss:
          scaled_loss.backward()
      else:
        self.loss.backward()
      #self.loss.backward()
      #self.optimizer.step()

      # Update pbar-tqdm
      #pred = y_pred.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
      '''n_digits=1
      output = (output * 10**n_digits).round() / (10**n_digits)
      target = (target * 10**n_digits).round() / (10**n_digits)
      correct = output.eq(target).sum().item()
      print("correct is {}".format(correct))'''
      correct=0
      lr = 0
      if self.scheduler:
        lr = self.scheduler.get_last_lr()[0]
      else:
        # not recalling why i used sekf.optimizer.lr_scheduler.get_last_lr[0]
        lr = self.optimizer.param_groups[0]['lr']
      
      #lr =  if self.scheduler else (self.optimizer.lr_scheduler.get_last_lr()[0] if self.optimizer.lr_scheduler else self.optimizer.param_groups[0]['lr'])
      
      self.stats.add_batch_train_stats(self.loss.item(),correct, len(data1), lr)
      pbar.set_description(self.stats.get_latest_batch_desc())
      if self.scheduler:
        self.scheduler.step()

My calling function of train module

import torch.optim as optim
import torch.nn as nn
import time
EPOCHS =1
#max_lr=1
optimizer = optim.SGD(model.parameters(), lr=0.01)
model, optimizer = amp.initialize(
   model, optimizer, opt_level="O2", 
   keep_batchnorm_fp32=True, loss_scale="dynamic"
)
#criterion=nn.L1Loss()
criterion=[nn.BCEWithLogitsLoss(),nn.L1Loss()]
#criterion = nn.BCEWithLogitsLoss()
start=time.time()
model.gotrain(optimizer, train_loader, test_loader, EPOCHS, "/content/gdrive/My Drive",criterion,use_amp=True)
print(time.time()-start)

Thank you sir

ptrblck · May 13, 2020, 1:54am

Use it at the beginning on the script via:

torch.backends.cudnn.benchmark = True

This will use the cudnn heuristics for each new input shape, and will thus slow down this iteration.
However, the following iterations should be faster, if a fast kernel was selected.

Thanks for this information. We should definitely fix this.

Sanjayvarma11 · May 13, 2020, 2:18am

So it means it will not always increase speed right.Okay and i am using cupy instead of numpy for faster processing while dataloading in pytorch.While i am using cupy i am getting following error.
Code i used to dataloading

  def __getitem__(self,index):
      if(torch.is_tensor(index)):
        index=index.tolist(index)
      input1=cp.array(Image.open(self.fg_bgimage[index]))
      input2=cp.array(Image.open(self.bg_image[index]))
      output1=cp.array(Image.open(self.mask_image[index]))
      output2=cp.array(Image.open(self.depth_image[index]))
      output2=output2.transpose(1,0)
      output1=self.return_binary(output1)
      if(self.transform):
        input1=self.transform[0](input1)
        input2=self.transform[1](input2)
        output1=self.transform[2](output1)
        output2=self.transform[3](output2)
      
      return input1,input2,output1,output2  
The calling function is :


import time

start=time.time()

i=0

pbar = tqdm_notebook(train_loader)

for data1,data2,target1,target2 in pbar:

  print(data1.shape)

  break

print(time.time()-start)

the error is 
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:4: TqdmDeprecationWarning: This function will be removed in tqdm==5.0.0
Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  after removing the cwd from sys.path.
0%
0/8750 [00:00<?, ?it/s]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-16-b126aa310636> in <module>()
      3 i=0
      4 pbar = tqdm_notebook(train_loader)
----> 5 for data1,data2,target1,target2 in pbar:
      6   print(data1.shape)
      7   break

5 frames
/usr/local/lib/python3.6/dist-packages/tqdm/notebook.py in __iter__(self, *args, **kwargs)
    213     def __iter__(self, *args, **kwargs):
    214         try:
--> 215             for obj in super(tqdm_notebook, self).__iter__(*args, **kwargs):
    216                 # return super(tqdm...) will not catch exception
    217                 yield obj

/usr/local/lib/python3.6/dist-packages/tqdm/std.py in __iter__(self)
   1102                 fp_write=getattr(self.fp, 'write', sys.stderr.write))
   1103 
-> 1104         for obj in iterable:
   1105             yield obj
   1106             # Update and possibly print the progressbar.

/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py in __next__(self)
    343 
    344     def __next__(self):
--> 345         data = self._next_data()
    346         self._num_yielded += 1
    347         if self._dataset_kind == _DatasetKind.Iterable and \

/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
    854             else:
    855                 del self._task_info[idx]
--> 856                 return self._process_data(data)
    857 
    858     def _try_put_index(self):

/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py in _process_data(self, data)
    879         self._try_put_index()
    880         if isinstance(data, ExceptionWrapper):
--> 881             data.reraise()
    882         return data
    883 

/usr/local/lib/python3.6/dist-packages/torch/_utils.py in reraise(self)
    393             # (https://bugs.python.org/issue2651), so we work around it.
    394             msg = KeyErrorMessage(msg)
--> 395         raise self.exc_type(msg)

cupy/cuda/runtime.pyx in cupy.cuda.runtime.CUDARuntimeError.__init__()

TypeError: an integer is required

Thank you sir

ptrblck · May 13, 2020, 2:22am

I don’t know, how cupy interacts with PyTorch, but it seems the cupy is failing to initialize the runtime?

Note that using the GPU in your data loading pipeline might not necessarily speed up the training, if you can utilize the GPU(s) fully. I.e. using multiple workers might allow your application to load and process the next batches in the background, while the GPU is busy with the model training.
If you use the GPU to process the data, it will block the training of course, if you are using a single device.