Pytorch RuntimeError: CUDA out of memory. Tried to allocate 256.00 GiB

I tried to run a model on colab and I have this error which seems to be really weird(256.00 GiB !!) same error occurred if I change the data size, the batch size, or clear the GPU memory.

the main model is a self-attention module (data is images)

here is the Traceback error :

Traceback (most recent call last):
      File "./train.py", line 169, in <module>
        miou_current = val(opt, model)
      File "./train.py", line 86, in val
        score = model.test(val=True)           # run inference
      File "/content/gdrive/My Drive/Colab Notebooks/STANet-withpth/models/CDFA_model.py", line 72, in test
        self.forward()
      File "/content/gdrive/My Drive/Colab Notebooks/STANet-withpth/models/CDFA_model.py", line 90, in forward
        self.feat_A, self.feat_B = self.netA(self.feat_A,self.feat_B)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/content/gdrive/My Drive/Colab Notebooks/STANet-withpth/models/backbone.py", line 46, in forward
        x = self.Self_Att(x)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/content/gdrive/My Drive/Colab Notebooks/STANet-withpth/models/BAM.py", line 37, in forward
        energy = torch.bmm(proj_query, proj_key)  # transpose check
    RuntimeError: CUDA out of memory. Tried to allocate 256.00 GiB (GPU 0; 14.76 GiB total capacity; 824.42 MiB already allocated; 11.68 GiB free; 1.80 GiB reserved in total by PyTorch)
import torch
import torch.nn.functional as F
from torch import nn


class BAM(nn.Module):
    """ Basic self-attention module
    """

    def __init__(self, in_dim, ds=8, activation=nn.ReLU):
        super(BAM, self).__init__()
        self.chanel_in = in_dim
        self.key_channel = self.chanel_in //8
        self.activation = activation
        self.ds = ds  #
        self.pool = nn.AvgPool2d(self.ds)
        print('ds: ',ds)
        # to produces the 3 tensors 
        self.query_conv = nn.Conv2d(in_channels=in_dim, out_channels=in_dim // 8, kernel_size=1)
        self.key_conv = nn.Conv2d(in_channels=in_dim, out_channels=in_dim // 8, kernel_size=1)
        self.value_conv = nn.Conv2d(in_channels=in_dim, out_channels=in_dim, kernel_size=1)
        #///////////////////////////////////
        #
        self.gamma = nn.Parameter(torch.zeros(1))

        self.softmax = nn.Softmax(dim=-1) 
         # Softmax function to an n-dimensional input Tensor rescaling them so that 
         #the elements of the n-dimensional output Tensor lie in the range [0,1] 
         #and sum to 1.

    def forward(self, input):
        """
            inputs :
                x : input feature maps( B X C X W X H)
            returns :
                out : self attention value + input feature
                attention: B X N X N (N is Width*Height)
        """
        x = self.pool(input)
        m_batchsize, C, width, height = x.size()
        proj_query = self.query_conv(x).view(m_batchsize, -1, width * height).permute(0, 2, 1)  # B X C X (N)/(ds*ds)
        proj_key = self.key_conv(x).view(m_batchsize, -1, width * height)  # B X C x (*W*H)/(ds*ds)
        energy = torch.bmm(proj_query, proj_key)  # transpose check
        energy = (self.key_channel**-.5) * energy
        
        attention = self.softmax(energy)  # BX (N) X (N)/(ds*ds)/(ds*ds)

        proj_value = self.value_conv(x).view(m_batchsize, -1, width * height)  # B X C X N

        out = torch.bmm(proj_value, attention.permute(0, 2, 1))
        out = out.view(m_batchsize, C, width, height)

        out = F.interpolate(out, [width*self.ds,height*self.ds])
        out = out + input

        return out

batch_size = 8 , C =64 , N=128*64

Anyone had the same error or know how to deal with it or even explain me why this is happening ??

1 Like

Could you print the shapes of proj_query and proj_key, as it seems that the bmm operation using these tensors is trying to create this huge tensor and I guess these tensors might be (accidentally) broadcasted.

tensor([[[ 0.1062, -0.0706,  0.0343,  ...,  0.0896,  0.1241,  0.0811],
         [ 0.1552, -0.0632,  0.0497,  ...,  0.0547,  0.0725,  0.0960],
         [ 0.0970, -0.0783,  0.0647,  ...,  0.0812,  0.0578,  0.1070],
         ...,
         [ 0.1979, -0.0234,  0.0873,  ...,  0.1047,  0.1507,  0.0587],
         [ 0.2031, -0.0630,  0.1304,  ...,  0.0985,  0.1816,  0.0514],
         [ 0.1303, -0.1082,  0.0683,  ...,  0.0532,  0.1669,  0.0524]],

        [[ 0.1829, -0.0706,  0.0533,  ...,  0.0783,  0.1255,  0.1043],
         [ 0.1891, -0.0596,  0.0698,  ...,  0.1247,  0.0781,  0.1504],
         [ 0.1921, -0.0589,  0.0724,  ...,  0.1275,  0.0724,  0.2050],
         ...,
         [ 0.1394, -0.0822,  0.0662,  ...,  0.1140,  0.1395,  0.1234],
         [ 0.1750, -0.0778,  0.0945,  ...,  0.0925,  0.1522,  0.0401],
         [ 0.1537, -0.1094,  0.0306,  ...,  0.0350,  0.1453,  0.0510]],

        [[ 0.1791, -0.0545,  0.0312,  ...,  0.1040,  0.1212,  0.0353],
         [ 0.2351, -0.0766,  0.0298,  ...,  0.0786,  0.0569,  0.1059],
         [ 0.2285, -0.0711,  0.0330,  ...,  0.0844,  0.0642,  0.1682],
         ...,
         [ 0.1871, -0.0736, -0.0312,  ...,  0.1648,  0.1705, -0.0026],
         [ 0.1876, -0.0933, -0.0158,  ...,  0.1435,  0.1671, -0.0308],
         [ 0.1806, -0.1350, -0.0156,  ...,  0.1419,  0.1417, -0.0586]],

        [[ 0.2451, -0.0512,  0.0207,  ...,  0.1242,  0.1442,  0.0446],
         [ 0.2269, -0.0461, -0.0124,  ...,  0.1284,  0.0918,  0.1082],
         [ 0.2298, -0.0476, -0.0281,  ...,  0.1397,  0.0669,  0.1410],
         ...,
         [ 0.0859, -0.0831,  0.0446,  ...,  0.1194,  0.1112,  0.0999],
         [ 0.1291, -0.0871,  0.0768,  ...,  0.1306,  0.1285,  0.0413],
         [ 0.1610, -0.1204,  0.0627,  ...,  0.1046,  0.1033,  0.0654]]],
       device='cuda:0')
tensor([[[-0.0857, -0.1034, -0.1035,  ..., -0.1175, -0.0724, -0.1227],
         [-0.0146, -0.0138, -0.0380,  ..., -0.0315, -0.0465,  0.0175],
         [ 0.0160,  0.0007,  0.0163,  ...,  0.0147,  0.0308,  0.0090],
         ...,
         [-0.0352, -0.0108, -0.0365,  ..., -0.1049, -0.0922, -0.0531],
         [ 0.2708,  0.2582,  0.2379,  ...,  0.2625,  0.2461,  0.2378],
         [ 0.0994,  0.0935,  0.0687,  ...,  0.1915,  0.1793,  0.0974]],

        [[-0.1221, -0.1453, -0.1377,  ..., -0.0979, -0.1344, -0.1147],
         [ 0.0085, -0.0006, -0.0129,  ..., -0.0306, -0.0060,  0.0057],
         [-0.0065,  0.0018,  0.0445,  ...,  0.0048,  0.0242,  0.0162],
         ...,
         [-0.0500, -0.0181, -0.0331,  ..., -0.0701, -0.0888, -0.0527],
         [ 0.3233,  0.3220,  0.2983,  ...,  0.2219,  0.2088,  0.2299],
         [ 0.1187,  0.0565,  0.0609,  ...,  0.1153,  0.1416,  0.0839]],

        [[-0.1152, -0.1286, -0.0901,  ..., -0.1428, -0.1430, -0.1392],
         [-0.0378, -0.0819, -0.0742,  ..., -0.0811, -0.0488, -0.0468],
         [ 0.0232, -0.0236,  0.0366,  ..., -0.0430, -0.0368,  0.0521],
         ...,
         [-0.0138, -0.0094, -0.0153,  ...,  0.0405,  0.0340, -0.0206],
         [ 0.3609,  0.3894,  0.3172,  ...,  0.3283,  0.2934,  0.2837],
         [ 0.1029,  0.1177,  0.1286,  ...,  0.0946,  0.0702,  0.0427]],

        [[-0.1581, -0.1968, -0.1531,  ..., -0.1368, -0.1294, -0.1570],
         [-0.0574, -0.0339, -0.0765,  ..., -0.1366, -0.1136, -0.0715],
         [ 0.0139,  0.0585,  0.0980,  ...,  0.0896,  0.0811,  0.1045],
         ...,
         [-0.0468, -0.0367, -0.0364,  ..., -0.0763, -0.0872, -0.0648],
         [ 0.4041,  0.4157,  0.4007,  ...,  0.3263,  0.2994,  0.2842],
         [ 0.1058,  0.1614,  0.1824,  ...,  0.1432,  0.1486,  0.1457]]],
       device='cuda:0')

Thanks for the update! You’ve printed the truncated values of the tensors, so the shapes are still unknown. You can print the shapes of tensors via print(tensor.shape).

I printed the shape of tensors and here is the result : with 8 as batch size
torch.Size([8, 8192, 8])
torch.Size([8, 8, 8192])

and it’s showing me this error now when i minimised the data size :
RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 11.17 GiB total capacity; 8.99 GiB already allocated; 1.32 GiB free; 9.39 GiB reserved in total by PyTorch)

Both tensors will allocate 2MB of memory (8 * 8192 * 8 * 4 / 1024**2 = 2.0MB) and the result will use 2.0GB, which would fit your last error message. You could run this code snippet to verify it:

a = torch.randn(8, 8192, 8, device='cuda')
b = torch.randn(8, 8, 8192, device='cuda')
print('{:.3f}MB'.format(torch.cuda.memory_allocated()/1024**2))
print(torch.cuda.memory_summary())

c = torch.matmul(a, b)
print(c.shape)
print('{:.3f}MB'.format(torch.cuda.memory_allocated()/1024**2))
print(torch.cuda.memory_summary())

However, this doesn’t fit the initial description of the error which claims you are trying to allocate 256GB of memory, so you would have to narrow down the operation further, as it’s not caused by this matrix multiplication with these tensor shapes.

thank you so much for your reply .
I just want to print here the last erreur which I think it would be caused by something else

Traceback (most recent call last):
  File "./train.py", line 141, in <module>
    model.optimize_parameters()   # calculate loss functions, get gradients, update network weights
  File "/content/gdrive/My Drive/Colab Notebooks/STANet-withpth/models/CDFA_model.py", line 117, in optimize_parameters
    self.backward()                   # calculate graidents for G
  File "/content/gdrive/My Drive/Colab Notebooks/STANet-withpth/models/CDFA_model.py", line 107, in backward
    self.loss.backward()
  File "/usr/local/lib/python3.7/dist-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 11.17 GiB total capacity; 9.06 GiB already allocated; 1.32 GiB free; 9.39 GiB reserved in total by PyTorch)

The new error could be caused by the mentioned matrix multiplication, since PyTorch tries to allocate 2GB, which would also be needed by this operation.
However, since this is expected you might want to reduce the memory usage e.g. by lowering the batch size.
As previously described, the initial error claiming to allocate 256GB is most likely caused by another operation and my best guess is that unwanted broadcasting might have been applied in an operation.

Hi again , Just want to share with you the solution of this problem , actually I had a problem on the size of the images wut made the training impossible , the image size was 1024*1024 on the validation file , although I resized data images on the training folder
that was the main problem so I hade to resize all images on the val folder to 256*256 and now it’s working .

the script of resizing the images size

from glob import glob
import cv2
from matplotlib import pyplot as plt
from tqdm import tqdm

!mkdir ./data/val
!mkdir ./data/val/A
!mkdir ./data/val/B
!mkdir ./data/val/label

file_list= glob("../ChangeDetection/data/LEVIR_CD/val_orig/A/*.png")
for file_name in tqdm(file_list):
    img_data = cv2.imread(file_name, cv2.IMREAD_UNCHANGED)
    img_resized_data = cv2.resize(img_data, (256, 256))
    dst_name = file_name.replace("_orig", "")
    cv2.imwrite(dst_name, img_resized_data)

thankx Srie Raam for Help

Hi,
I got the same error after updating my PyTorch to ‘1.10.2’.
The error happens very soon in (model.to(device) ) or a little bit later in calling model.forward(). I changed batch_size from 64 to 8, and now it looks like it works, but I am wondering what else can I do. Is uninstalling the new version and then reinstalling the old version of PyTorch a good idea? I am worried that it causes some problems for other codes that have been running. I would appreciate it if you can help me.

You could check how large the peak memory of your application is between the older PyTorch release and the current one (1.12). If you want to save additional memory, you could install the latest nightly with the CUDA 11.7 runtime and enable lazy module loading via CUDA_MODULE_LOADING=LAZY (which should also be enabled by default in the latest nightlies.

1 Like

Thanks for your help.