Specifying GPU to use with DataParallel

I have a DataParallel model that has been sent to the GPUs via .to('cuda'). I also have some processes calling this model in parallel at various points. It seems like because these are forward passes of batch size 1, they are automatically allocated to CUDA:0, which results in disproportionately high GPU utilization on that device.

How do I specify which GPU is used in a forward pass? I don’t want to have to do any sending of parameters / state dicts. Thanks.

I do not think there is any option/parameter to tell DataParallel which GPUs to use for inference/forward. If you are only doing inference, won’t it be easy to maintain models in each GPU manually (mode.to(device)) instead of DataParallel?

Yes, I’m only doing inference. I thought that if we had model_new=model.to('cuda:1'), then after an update to model, the parameters wouldn’t be synced. Is that not right? Thanks.

If you are only doing inference in .eval() (not .train() mode), there is no need for parameter sync. Isn’t it?

I thought that if you made a model model_new = model.to('cuda:1'), and then updated the parameters of model with model_optimizer.step(), then the parameters of model_new would be out of sync / differ from model?

According to the tutorial, DataParallel splits “data” automatically across available GPUs. I’m pretty sure it only works on batches, so you need batches of more than 1 sample, otherwise it might (a) make no sense to split data, (b) be very inefficient due to synchronization…

Do you even have any usage on other GPUs than the first one? If you have batch sizes of 1, nothing would be split across GPUs and only CUDA:0 would be used.

Indeed, batch size 1 with DataParallel goes to first specified device (or defaults to cuda:0).

If you want to do inference with batch size 1 there is no need to use nn.DataParallel. This would be useful only if you have a much larger batch that you want to automatically split and automatically run on multiple GPUs. If you want to manually balance batches of size 1 you’re going to have to copy the model yourself and round robin over it. You’re right that the weights are not automatically updated if the source model is updated, because they are different tensors at that point. You’ll have to re-initialize the per-device models every time after running an optimizer on a single source model. In fact, this is exactly how nn.DataParallel works under the covers. On every call to forward it replicates a single module over N devices, scatters the input, runs forward on every one of them, and gathers the outputs. This repeats for every iteration.