I am trying to extract features for a dataset and save them in 2D tensor with gradient information. Getting torch.cuda.OutOfMemoryError: CUDA out of memory

Debojyoti_Biswas · November 10, 2022, 4:59am

Hi,
I am training a CNN model for extracting features for a dataset. I want to save all extracted features in a 2D tensor with the gradient information. The below code block is throwing torch.cuda.OutOfMemoryError: CUDA out of memory. The batch size for the dataloader is 4 and the available GPU memory is 49GB.

a= []

for data , iteration in zip(dataloader, range(0, 10)):

    images = self.preprocess_image(data)

    features = self.backbone(images.tensor)

    for k in features.keys():
        if k == 'p3':
  
            neg_f= features[k].reshape(features[k].shape[0],-1).to(self._cpu_device)
            print(f"Check device neg_f: {neg_local_query_f.is_cuda}") 

            a.append(neg_f[0])
            
a = torch.cat(a, 0)
a = a.reshape(10,640000)

The neg_f tensor is stored in cpu memory. Although, if I stop gradients tracking I do not get any torch.cuda.OutOfMemoryError: CUDA out of memory error. The below code does not give error but I am not getting gradient information.

a= []

for data , iteration in zip(dataloader, range(0, 10)):

   images = self.preprocess_image(data)

   features = self.backbone(images.tensor)

   with torch.no_grad():
       for k in features.keys():
           print("This is the Backbone Return :",k,"shape",features[k].shape)
           if k == 'p3':

               neg_f= features[k].reshape(features[k].shape[0],-1).to(self._cpu_device)

               a.append(neg_f[0])
               print(f"Check device: {neg_f.is_cuda}")
           

a = torch.cat(a, 0)
a = a.reshape(10,640000)

Is there any way I can store the features with gradient information and not causing torch.cuda.OutOfMemoryError: CUDA out of memory error? And my final goal is to pass the stored 2D tensor features in Contrastive Loss as negative examples.

Thanks in advance!!

ptrblck · November 10, 2022, 5:43am

I assume you want to store all computation graphs to store the “gradient information”?
If so, then the large increase in memory would be expected. Detaching the tensor would reduce the memory usage, but won’t allow you to compute the gradients w.r.t. the previously used parameters anymore.
You could reduce the batch size and compute the gradients using your custom loss for a smaller number of samples while accumulating the gradients (if this is possible using your loss).