CUDA out of memory. Tried to allocate 98.00 MiB (GPU 0; 15.90 GiB total capacity; 14.80 GiB already allocated; 47.88 MiB free; 15.16 GiB reserved in total by PyTorch)

def sub_forward(self, x)
      x = self.conv1(x)
      x = self.bn1(x)
      x = self.relu(x)
      x = self.maxpool(x)
      print("ok")
      x = self.layer1(x)
      x = self.layer2(x)
      x = self.layer3(x)
      x = self.layer4(x)
      print("ok1")
      return x

    def forward(self, x, labels=None, return_cam=False):
        batch_size = x.shape[0]
        x1 = self.sub_forward(x)
        x1 = self.conv6(x1)
        x1 = self.relu(x1)
        x1 = self.conv7(x1)
        x1 = self.relu(x1)
        x2 = self.sub_forward(x)
        x2 = self.conv8(x2)
        x2 = self.relu(x2)
        x2 = self.conv9(x2)
        x2 = self.relu(x2)
        x = x1 + x2
        if return_cam:
          normalized_feature_map = normalize_tensor(x.detach().clone())
          cams = normalized_feature_map[range(batch_size), labels]
          return cams

        x = self.avgpool(x)
        x = x.view(x.size(0), -1)
        print("ok2")
        return {'logits': x}

I am training a resnet model. I can not change the architecture of the model, nor can I reduce the batch size. Is there anything else I can do to avoid this error?

Output:

ok
ok1
ok
ok1
ok2
ok

It can be seen that it completes the first forward pass but fails in second forward pass

@albanD any suggestions?

Hi,

Not much, that’s why I did not answer anything :confused:
You can check this related post for potential solutions: How can you train your model on large batches when your GPUs can only hold couple of batches

1 Like

I am working on 13G ram google colab gpu. Would moving to a larger ram gpu solve this problem?

Yes having more memory will remove this issue. How much more you need though, I don’t know :confused:
EDIT: As mentioned below, yes, more GPU memory will help, not more ram. I misread the message above.

@albanD it is not RAM memory you are needing is GPU memory… try smaller batch sizes , 32, 16, 8, 4 ,2 1…

But with smaller batch size you really not going to get good generalization…

1 Like