Add_ doesn't use all the GPUs and throw CUDA out of memory

Hi, I’m using nn.DataParallel(model). The training is working well. But After some epochs; i do some computation which seems to be happened on one GPU and thus a CUDA out of memory is thrown. This is the part of the code that gives the error

  File "/home/xxx/xx/xxx/trainers.py", line 115, in compute_distance
    dists.add_(torch.tril(100000 * torch.ones(len(features), len(features)).cuda()))
RuntimeError: CUDA out of memory. Tried to allocate 3.96 GiB (GPU 0; 11.17 GiB total capacity; 9.46 GiB already allocated; 1.39 GiB free; 17.78 MiB cached)

Is there a way to make add_ op works. I don’t want to reduce the batch-size as this will increase the training time.

Thank you

I assume this method is called in your forward method?
If so, use a device from a known parameter (e.g. .to(my_param.device)) instead of the default device using .cuda().

Thank you,
unfortunately, this is not inside the forward method. I compute the pairwise distance of the extracted feature. Since i have a reference to the model, Can I use .to(model.device)? or .to(model.my_param.device)?

Thanks.

If this is not inside the forward method, nn.DataParallel won’t be able to replicate it on each device.
Since you are running it outside, you could try to not use the default device, as it’s memory usage will usually be bigger than from the other devices.

Thank you., I managed to make it work up to some point. Now, the data in cuda:0 need to be added to the data in cuda:1. If i moved the data in cuda0 to cuda1 before the addition, it throw CUDA out of memory.
Let say, if i have

a0 = torch.randn(4, 4) # in cuda 0
a1 = torch.randn(4, 4) # in cuda 1

To compute a0 += a1, i first need to move a1 to cuda0 or a0 to cuda 1. Which throws cuda out of memory. However, the result of a0 += a1 if computed can hold in cuda0 or cuda1.
Is there a way to solve this problem. I’m using our University GPU , a Tesla K80
Thank you.

I don’t think there is a peer2peer add function, so one device would have to store both tensors before the addition.

i first need to move a1 to cuda0 or a0 to cuda 1. Which throws cuda out of memory. However, the result of a0 += a1 if computed can hold in cuda0 or cuda1.

If a0 and a1 don’t both fit on a single GPU, you could try putting part of each tensor each GPU, add each chunk, and then concatenate the results with torch.cat. i.e. split each tensor in half, put the first half of each tensor on ‘cuda:0’, add them, then concatenate the result of doing the same with the second half of the tensor.

Thank you. It working this way.

1 Like