Hi, I’m using nn.DataParallel(model). The training is working well. But After some epochs; i do some computation which seems to be happened on one GPU and thus a CUDA out of memory is thrown. This is the part of the code that gives the error
File "/home/xxx/xx/xxx/trainers.py", line 115, in compute_distance
dists.add_(torch.tril(100000 * torch.ones(len(features), len(features)).cuda()))
RuntimeError: CUDA out of memory. Tried to allocate 3.96 GiB (GPU 0; 11.17 GiB total capacity; 9.46 GiB already allocated; 1.39 GiB free; 17.78 MiB cached)
Is there a way to make add_ op works. I don’t want to reduce the batch-size as this will increase the training time.
I assume this method is called in your forward method?
If so, use a device from a known parameter (e.g. .to(my_param.device)) instead of the default device using .cuda().
Thank you,
unfortunately, this is not inside the forward method. I compute the pairwise distance of the extracted feature. Since i have a reference to the model, Can I use .to(model.device)? or .to(model.my_param.device)?
If this is not inside the forward method, nn.DataParallel won’t be able to replicate it on each device.
Since you are running it outside, you could try to not use the default device, as it’s memory usage will usually be bigger than from the other devices.
Thank you., I managed to make it work up to some point. Now, the data in cuda:0 need to be added to the data in cuda:1. If i moved the data in cuda0 to cuda1 before the addition, it throw CUDA out of memory.
Let say, if i have
a0 = torch.randn(4, 4) # in cuda 0
a1 = torch.randn(4, 4) # in cuda 1
To compute a0 += a1, i first need to move a1 to cuda0 or a0 to cuda 1. Which throws cuda out of memory. However, the result of a0 += a1 if computed can hold in cuda0 or cuda1.
Is there a way to solve this problem. I’m using our University GPU , a Tesla K80
Thank you.
i first need to move a1 to cuda0 or a0 to cuda 1. Which throws cuda out of memory. However, the result of a0 += a1 if computed can hold in cuda0 or cuda1.
If a0 and a1 don’t both fit on a single GPU, you could try putting part of each tensor each GPU, add each chunk, and then concatenate the results with torch.cat. i.e. split each tensor in half, put the first half of each tensor on ‘cuda:0’, add them, then concatenate the result of doing the same with the second half of the tensor.