If I set batch-size to 256 and use all of the GPUs on my system (lets say I have 8), will each GPU get a batch of 256 or will it get 256//8 ?
If my memory serves me correctly, in Caffe, all GPUs would get the same batch-size , i.e 256 and the effective batch-size would be 8*256 , 8 being the number of GPUs and 256 being the batch-size.
Is the outcome/answer any different when using .cuda() (with no parameters) and when using torch.nn.DataParallel(model, device_ids=args.gpus) (specifying which gpu ids to use)?
nn.DataParallel splits the data along the batch dimension so that each specified GPU will get a chunk of the batch. If you just call .cuda() (or the equivalent .to() call), you will push the tensor or parameters onto the specified single device.
@ptrblck,if nn.DataParallel used and batch size is 256 in experimental single GPU machine,now moving to a machine with n gpus(same type of gpu),should batch size be changed to n*256?
How about if DistributedDataParallel used?
Sorry for the confusion. I mean could we give more chunks to one gpu while give less for another cause the available memory is not the same for all the gpus at some time. Thanks for your reply!
That should be generally possible, e.g. by using parallel_apply as the base method and changing the splits internally or try to create a manual data parallel approach.
However, Iâm really unsure, if youâll see any significant speedup and how large the workload would be.
Hi,
I am having a similar issue here. If I use 1 GPU with batch_size 128, the job works fine. When I use 2 GPUs with batch_size 256 I am getting this error:
âRuntimeError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 1; 15.75 GiB total capacityâŚâ
I am using DistributedDataParallel as per below:
âtorch.distributed.init_process_group(backend=âncclâ)
model = nn.parallel.DistributedDataParallel(model, device_ids = list(range(n_gpu))[::-1],find_unused_parameters=True)â
and I am running the job with : python3 -m torch.distributed.launch main.py
can you provide a short example on how to do that? this does not help on how to do use DistributedDataParallel.
Using DataParallel alone is easy but then it has the problem that others mentioned. i.e if one gpu can run 256 batches, using Dataparallel for 2 gpus does not let us use 2*256.