Pytorch cuda out of memory?

vymao · November 13, 2020, 6:05pm

I found the following error that I am getting on Github, but it doesn’t seem to have been solved. This is the traceback:

Traceback (most recent call last):
  File "/path/to/run.py", line 380, in <module>
    loss = train()
  File "/path/to/run.py", line 64, in train
    out = model(data).view(-1)
  File "/n/scratch3/users/v/vym1/nn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/path/to/WLGNN.py", line 114, in forward
    xs += [torch.tanh(conv(xs[-1], edge_index, edge_weight))]
  File "/n/scratch3/users/v/vym1/nn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/n/scratch3/users/v/vym1/nn/lib/python3.7/site-packages/torch_geometric/nn/conv/gcn_conv.py", line 169, in forward
    size=None)
  File "/n/scratch3/users/v/vym1/nn/lib/python3.7/site-packages/torch_geometric/nn/conv/message_passing.py", line 236, in propagate
    out = self.message(**msg_kwargs)
  File "/n/scratch3/users/v/vym1/nn/lib/python3.7/site-packages/torch_geometric/nn/conv/gcn_conv.py", line 177, in message
    return edge_weight.view(-1, 1) * x_j
RuntimeError: CUDA out of memory. Tried to allocate 3.37 GiB (GPU 0; 11.17 GiB total capacity; 5.35 GiB already allocated; 1.46 GiB free; 9.29 GiB reserved in total by PyTorch)

I have 100GB of memory allocated, and it isn’t clear to me why PyTorch can’t allocate it when it has only allocated a small fraction of the memory in total.

Scott_Hoang · November 13, 2020, 6:30pm

Do you have 100GB on a single GPU? (where can I buy one?)

vymao · November 13, 2020, 7:10pm

I’m working on a cluster and allocated 100 GB of memory. Is this not available for the GPU?

siyuchen95 · November 13, 2020, 7:30pm

If the GPU is full, there’s no other way but to get a bigger one.
However, if you expect this to work normally, check if you have the tensors in their correct dimension. Sometimes broadcasting can give you a super big matrix if you are not careful.

Scott_Hoang · November 13, 2020, 9:05pm

You can have 100Gb allocated, but they are spread out across x number of GPUs. You got to make your batch size to fit per GPU.

ptrblck · November 15, 2020, 10:24am

Your current GPU has a capacity of 11.17GiB as given in the error message.
Are you allocating 100GB on your system RAM or are you combining the memory of all GPUs?

As explained by others you would have to either reduce the batch size (or model) or trade compute for memory via torch.utils.checkpoint.