A question concerning batchsize and multiple GPUs in Pytorch

If I set batch-size to 256 and use all of the GPUs on my system (lets say I have 8), will each GPU get a batch of 256 or will it get 256//8 ?
If my memory serves me correctly, in Caffe, all GPUs would get the same batch-size , i.e 256 and the effective batch-size would be 8*256 , 8 being the number of GPUs and 256 being the batch-size.
Is the outcome/answer any different when using .cuda() (with no parameters) and when using torch.nn.DataParallel(model, device_ids=args.gpus) (specifying which gpu ids to use)?

Thanks a lot in advance

1 Like

nn.DataParallel splits the data along the batch dimension so that each specified GPU will get a chunk of the batch. If you just call .cuda() (or the equivalent .to() call), you will push the tensor or parameters onto the specified single device.


@ptrblck,if nn.DataParallel used and batch size is 256 in experimental single GPU machine,now moving to a machine with n gpus(same type of gpu),should batch size be changed to n*256?
How about if DistributedDataParallel used?

I would generally recommend to use DistributedDataParallel, even for a single machine with multiple GPUs.

Yes, to get a potential performance increase, you should try to scale the batch size.


hello ptrblck, are the chunks equal in each specified GPU ? If not, are there any solution to make the chunks for different GPUS be more flexible.

No, each GPU should get a part of the batch without repetition.

I’m not sure I understand this question properly. Could you explain a bit, what you mean by “more flexible”?

Sorry for the confusion. I mean could we give more chunks to one gpu while give less for another cause the available memory is not the same for all the gpus at some time. Thanks for your reply!

That should be generally possible, e.g. by using parallel_apply as the base method and changing the splits internally or try to create a manual data parallel approach.

However, I’m really unsure, if you’ll see any significant speedup and how large the workload would be.

1 Like

I am having a similar issue here. If I use 1 GPU with batch_size 128, the job works fine. When I use 2 GPUs with batch_size 256 I am getting this error:
‘RuntimeError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 1; 15.75 GiB total capacity…’

I am using DistributedDataParallel as per below:

model = nn.parallel.DistributedDataParallel(model, device_ids = list(range(n_gpu))[::-1],find_unused_parameters=True)’

and I am running the job with : python3 -m torch.distributed.launch main.py

Thanks in advanced

I guess the default device (GPU0) might run out of memory, as your work flow seems to be close to nn.DataParallel as described here.

Could you try to use the recommend use case of one device/replica per DDP process as described here?

1 Like

can you provide a short example on how to do that? this does not help on how to do use DistributedDataParallel.
Using DataParallel alone is easy but then it has the problem that others mentioned. i.e if one gpu can run 256 batches, using Dataparallel for 2 gpus does not let us use 2*256.

This and this tutorial should give you a good starter.