CUDA out of memory even with DataParallel

Akshay_Goindani · May 11, 2020, 7:29pm

I was using 1 GPU and batch size was 64 and I got cuda out of memory. So I reduced the batch size to 16 to solve it. But when I am using 4 GPUs and batch size 64 with DataParallel then also I am getting the same error:

my code:
device = torch.device(‘cuda’ if torch.cuda.is_available() else ‘cpu’)
device_ids = os.environ[“CUDA_VISIBLE_DEVICES”].split(",")
device_ids = [int(i) for i in device_ids]

encoder = EncoderRNN(input_size, hidden_size, SRC).to(device)
decoder = DecoderRNN(hidden_size, output_size, TRG).to(device)
model = nn.DataParallel(Model(encoder, decoder), device_ids = device_ids).to(device)

With DataParallel we can use multiple GPU and hence increase batch size.

ptrblck · May 12, 2020, 6:06am

nn.DataParallel might use more memory on the default device as described in this blog post. We generally recommend to use DistributedDataParallel to avoid these issues and to get the best performance.

Akshay_Goindani · May 12, 2020, 6:21am

Should the device be cuda:0 or cuda.
Also, after using multiple gpus and dataparallel, we should be able to increase the batch size as compared to a single gpu. In my case I am still not able to increase the batch size.

ptrblck · May 12, 2020, 6:23am

cuda:0 and cuda should refer to the same device.
Could you check that all devices are empty before running the script?
If that’s the case, I would recommend to try DDP.

Olivier-CR · October 27, 2021, 11:43pm

could the PyTorch team fix it then? it’s important to have a single-process multi-device solution, especially as multi-GPU nodes get bigger and bigger, it’s less and less useful to do multi-node training. We need something simpler than DDP please

ptrblck · October 28, 2021, 12:53am

Double post from here.