I have a server equipped with 4 Titan X GPU; The problem is: when I run my code, it reports “cuda runtime error (2) : out of memory”. The output of gpustat tells me that only gpu0’s memory is being used.
The corresponding tensorflow code can use all memory from 4 gpu, which is 48G in my case.
May be this extra .cuda() on the DataParallel wrapper is causing the problem.
Take a look on the CUDA semantics and DataParallel.
Have you tried this way ?
device = ("cuda" if torch.cuda.is_available() else "cpu" )
model = net()
if torch.cuda.device_count() > 1:
# device_ids has a default : all
model = torch.nn.DataParallel(model, device_ids=[0, 1, 2, 3])
model.to(device)
It reports the same error, even if I try the code snippet you provided. Thanks anyway.
I am suspecting the reason is I am using for loop in the code. My input format is [batch, time, height, width, channel]. My code do for loop on time axis.
For dataparallel issue, as I suspected, my code has for loop for temporal/time axis, which I put in the first axis like [time, batch, channel, height, width]. So I have total 20 time-step with 4 gpu, DataParallel would split my input to 4 shares, each one with length 5. Thus when I try to index to 6th element, it report index out of range error.