CUDA out of memory even with DataParallel

I was using 1 GPU and batch size was 64 and I got cuda out of memory. So I reduced the batch size to 16 to solve it. But when I am using 4 GPUs and batch size 64 with DataParallel then also I am getting the same error:

my code:
device = torch.device(‘cuda’ if torch.cuda.is_available() else ‘cpu’)
device_ids = os.environ[“CUDA_VISIBLE_DEVICES”].split(",")
device_ids = [int(i) for i in device_ids]

encoder = EncoderRNN(input_size, hidden_size, SRC).to(device)
decoder = DecoderRNN(hidden_size, output_size, TRG).to(device)
model = nn.DataParallel(Model(encoder, decoder), device_ids = device_ids).to(device)

With DataParallel we can use multiple GPU and hence increase batch size.

nn.DataParallel might use more memory on the default device as described in this blog post. We generally recommend to use DistributedDataParallel to avoid these issues and to get the best performance.

Should the device be cuda:0 or cuda.
Also, after using multiple gpus and dataparallel, we should be able to increase the batch size as compared to a single gpu. In my case I am still not able to increase the batch size.

cuda:0 and cuda should refer to the same device.
Could you check that all devices are empty before running the script?
If that’s the case, I would recommend to try DDP.

could the PyTorch team fix it then? it’s important to have a single-process multi-device solution, especially as multi-GPU nodes get bigger and bigger, it’s less and less useful to do multi-node training. We need something simpler than DDP please :slight_smile:

Double post from here.