RuntimeError: CUDA out of memory. Tried to allocate 2.18 GiB (GPU 0; 15.92 GiB total capacity; 13.71 GiB already allocated; 1.25 GiB free; 13.74 GiB reserved in total by PyTorch)

File "d:\anaconda3\lib\site-packages\fire\core.py", line 138, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "d:\anaconda3\lib\site-packages\fire\core.py", line 468, in _Fire
    target=component.__name__)
  File "d:\anaconda3\lib\site-packages\fire\core.py", line 672, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "main.py", line 103, in train
    optimizer.step()
  File "d:\anaconda3\lib\site-packages\torch\autograd\grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "d:\anaconda3\lib\site-packages\torch\optim\adam.py", line 107, in step
    denom = (exp_avg_sq.sqrt() / math.sqrt(bias_correction2)).add_(group['eps'])

i set the Batch size=1;but the CUDA out of memory…I don’t know what…
this is my train:

    for epoch in range(opt.max_epoch):
        loss_mean = 0.0
        loss_val = 0.0
        loss_meter.reset()

        for ii,(data,label) in tqdm(enumerate(train_dataloader),total=len(train_data)):

            # train model 
            input = Variable(data)
            input = input.float()
            target = Variable(label)
            if opt.use_gpu:
                input = input.cuda()
                target = target.cuda()

            optimizer.zero_grad()
            score = model(input)
            loss = criterion(score,target) / (400 * 190)
            loss.backward()
            optimizer.step() 

the error occurs in optimizer.step() of the first epoch…

I had the same problem, and I solved by using with torch.no_grad():

For example,

            # train model 
            input = Variable(data)
            input = input.float()
            target = Variable(label)
            
            if opt.use_gpu:
                input = input.cuda()
                target = target.cuda()

            optimizer.zero_grad()
            with torch.no_grad() :            
                score = model(input)
                loss = criterion(score,target) / (400 * 190)

            loss = Variable(loss, requires_grad = True)
            loss.backward()
            optimizer.step() 
1 Like

thank you .the CUDA out of memory be sloved.

Wrapping the forward pass in a torch.no_grad() block will not store any intermediate activations, which would be needed to compute the gradients during the backward pass.
You would get an error in loss.backward(), but you are avoiding it by detaching the loss and setting requires_grad=True in:

loss = Variable(loss, requires_grad = True)

However, your training is still broken and the model will not be updated.

torch.no_grad() should only be used during evaluation and testing, if no gradients should be computed and no parameter updates are needed.

CC @tianle-BigRice

2 Likes

yeah, I didn’t use torch.no_grad in the train before. and i used torch.no_grad in the model.eval . This error will not occur . But this time I don’t know why the error occur in the train .even i set the batch size=1,this error still occurs, i didnt konw why.

If you are not storing the loss directly in e.g. a list or any other tensor, which is attached to the computation graph, your model might just use too much memory.
Are you seeing the OOM issue in the first iteration(s) or later in training?
In the former case, you could try to trade compute for memory via torch.utils.checkpoint, while the latter case points towards storing tensors without detaching them.

yeah ,i seeing the OOM issue in the first iteration(s) , This error occurred at optimizer.step().and i storing the loss:(but the issue remains)

score = model(input)
loss = criterion(score,target) 
running_loss = loss.item()
loss.backward()
optimizer.step()  #optimizer = t.optim.Adam(model.parameters(),lr = lr)

I found this error because score = model(input). So should I add torch.utils.checkpoint into the model?

if i use the code in train:

            with t.no_grad() :
                score = model(input)

the parameter will not be updated?
and i use the code:

loss = checkpoint(criterion,score,target)

it didnt work
Thank you very much for your continued help

this is my error :


  File "D:\python\RDANET\main.py", line 197, in <module>
    fire.Fire(train)

  File "d:\anaconda3\lib\site-packages\fire\core.py", line 138, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)

  File "d:\anaconda3\lib\site-packages\fire\core.py", line 463, in _Fire
    component, remaining_args = _CallAndUpdateTrace(

  File "d:\anaconda3\lib\site-packages\fire\core.py", line 672, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "D:\python\RDANET\main.py", line 112, in train
    optimizer.step()

  File "d:\anaconda3\lib\site-packages\torch\autograd\grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)

  File "d:\anaconda3\lib\site-packages\torch\optim\adam.py", line 91, in step
    state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)

RuntimeError: CUDA out of memory. Tried to allocate 1.35 GiB (GPU 0; 15.92 GiB total capacity; 12.89 GiB already allocated; 1.22 GiB free; 13.67 GiB reserved in total by PyTorch)

Is this the storing tensors without detaching them you mentioned?
thank you very much
and the code is my model forward:

image_size = [400,190] , block_nums = 8, num_classes=1000
        x = self.maxpool(F.leaky_relu(self.conv1_1(x)))
        x = self.maxpool(F.leaky_relu(self.conv2_1(x)))
        x = self.maxpool(F.leaky_relu(self.conv3_2(F.leaky_relu(self.conv3_1(x)))))
        x = self.maxpool(F.leaky_relu(self.conv4_2(F.leaky_relu(self.conv4_1(x)))))
        x = self.maxpool(F.leaky_relu(self.conv5_2(F.leaky_relu(self.conv5_1(x)))))
        x = self.dropout(x)
        x = x.view(x.size(0), 512 * 12 * 5)
        x = F.leaky_relu(self.fc1(x))
        x = F.leaky_relu(self.fc2(x))
        x = x.reshape(x.size(0),-1, int(self.image_size[0]/2), int(self.image_size[1]/2))
        # print('x:',x.size())
        '''
        
        x = self.res1(x)
        out = self.res2(x)
        out = self.res3(out)
        out = out + x
        out = self.res4(out)
        # print('res4:',out.size())
        #shuffle
        # print(out.shape)
        out = self.shuffle(out)
        # print(out.shape)
        # print('shuffle:',out.size())
        out = self.res5(out)
        # print(out.shape)
        out = np.squeeze(out)
        # print(out.shape)
        return out

Yes, the parameters will not get any gradients and thus the optimizer will not update them.

You should wrap blocks of the models into a checkpoint. You can find an older tutorial here.

I wrap blocks of the models into a checkpoint . according to the tutorial,But crazy error report.
my code:

def conv_lrelu(in_ch, out_ch, ker_sz, pad):
    return nn.Sequential(nn.Conv2d(in_ch, out_ch, ker_sz, padding=pad, bias=False),
                         nn.LeakyReLU())
def seg1(self, x):
      x = self.layer1(x)
      x = self.maxpool(x)
      return x
def fc(self, x):
      x = self.dropout(x)
      x = x.view(x.size(0), 512 * 12 * 5)
      x = F.leaky_relu(self.fc1(x))
      x = F.leaky_relu(self.fc2(x))
      x = x.view(x.size(0),-1, int(self.image_size[0]/2), int(self.image_size[1]/2))
def EDSR(self,x):
      x = self.res1(x)
      out = self.res2(x)
      out = self.res3(out)
      out = out + x
      out = self.res4(out)
      out = self.shuffle(out)
      out = self.res5(out)
      out = np.squeeze(out)
      return out
x = checkpoint(self.seg1, x)
x = checkpoint(self.seg2, x)
x = checkpoint(self.seg3, x)
x = checkpoint(self.seg4, x)
x = checkpoint(self.seg5, x)
x = checkpoint(self.fc, x)
out = checkpoint(self.EDSR, x)

In addition, no matter how I adjust the batchsize, there will always be CUDA out of memory.But I have memory 32G.And the program can run normally on this computer before, there will be no CUDA out of memory problem。I really don’t know what went wrong, why it worked well before and now it can’t run。

The error message claims your device has 16GB, so you might be using the wrong device?

I have two gpus, the gpu0 have16G,the GPU1 have 16G,
What I am confused about is that the program used to run normally, but now there is always the problem of CUDA out of memory。

I just used another data set to test, the program can run normally,This is the model code for another set of data sets:

input = 128*128,image_size = [160,160] , block_nums = 10, num_classes=1000
 x = self.maxpool(F.leaky_relu(self.conv1_1(x)))
x = self.maxpool(F.leaky_relu(self.conv2_1(x)))
x = self.maxpool(F.leaky_relu(self.conv3_2(F.leaky_relu(self.conv3_1(x)))))
x = self.maxpool(F.leaky_relu(self.conv4_2(F.leaky_relu(self.conv4_1(x)))))
x = self.maxpool(F.leaky_relu(self.conv5_2(F.leaky_relu(self.conv5_1(x)))))
x = self.dropout(x)
x = x.view(x.size(0), 512 * 4 * 4)
x = F.leaky_relu(self.fc1(x))#self.fc1 = nn.Linear(512 * 4 * 4, 6400)
x = F.leaky_relu(self.fc2(x))#self.fc2 = nn.Linear(6400, 6400)
x = x.reshape(x.size(0),-1, int(self.image_size[0]/2),int(self.image_size[1]/2))
x = self.res1(x)
out = self.res2(x)
out = self.res3(out)
out = out + x
out = self.res4(out)
out = self.shuffle(out)
out = self.res5(out)
out = np.squeeze(out)

This is the code for the new data set:

input = 160*384,image_size = [400,190] , block_nums = 8, num_classes=1000
x = self.maxpool(F.leaky_relu(self.conv1_1(x)))
x = self.maxpool(F.leaky_relu(self.conv2_1(x)))
x = self.maxpool(F.leaky_relu(self.conv3_2(F.leaky_relu(self.conv3_1(x)))))
x = self.maxpool(F.leaky_relu(self.conv4_2(F.leaky_relu(self.conv4_1(x)))))
x = self.maxpool(F.leaky_relu(self.conv5_2(F.leaky_relu(self.conv5_1(x)))))
x = self.dropout(x)
x = x.view(x.size(0), 512 * 12 * 5)
x = F.leaky_relu(self.fc1(x))#self.fc1 = nn.Linear(512 * 12 * 5, 19000)
x = F.leaky_relu(self.fc2(x))#self.fc2 = nn.Linear(19000, 19000)
x = x.reshape(x.size(0),-1, int(self.image_size[0]/2), int(self.image_size[1]/2))       
x = self.res1(x)
out = self.res2(x)
out = self.res3(out)
out = out + x
out = self.res4(out)
out = self.shuffle(out)
out = self.res5(out)
out = np.squeeze(out)

I don’t understand why the new data set can’t work properly, is it because the input and output are too large? But I also reduced the batchsize and still can’t run

Yes, most likely. If neither a batch size of 1 can run nor you are able to use checkpointing, you could try to use model sharding, i.e. executing separate parts of the model on different GPUs.
If also this needs to much memory, you would have to change the model architecture and make sure the GPU requirement meets the available device memory.