I came across a problem when using multi-gpu training. Specifically, I’d like to load the supernet to CPU memory and optimize one of the branch on GPU in each iter (because the whole supernet is toooo big, which would cost a lot of GPU Memory if load all the parameters and pull down the power of GPU). But it’s hard for nn.DataParallel to pass parameters when using ‘‘model=nn.DataParallel(model, device_ids=engine.devices)’’ and ‘‘model.to(‘cpu’)’’ (because parameters can not broadcast when the model is allocated to cpu). I’m wondering if you have some better implementations?
Thanks a lot in advance.
I did not understand you at all. Dataprallel is not compatible with cpu training. You can manually assign each submodule to one gpu or dataparallel.
Using dataparallel in parallel with cpu is not worth as you will bottleneck the whole process. It will be faster if you allocate as much as possible into gpus and the least possible in cpu but sequentially.
Thanks for your reply! Actually, I’d like to update the model with numerous parameters, but found it consuming a lot of time in copying model across GPUs when using nn.DataParallel(). So, I’m wondering if it is possible to update the parameters in CPU using multiprocessing (GPUs are used in parallel to calculate grads). But I found it very time consuming as well. Thanks to torch.distributed, the training process has been accelerated even up to 4 times comparing with nn.DataParallel() (because torch.distributed do not copy the whole parameters across GPUs and reallocate the model in each iter).