PCI-e communication latency, transfer overheads, and weight sync on CPU mean that small models don’t benefit from multi-GPU training.
Can you try large image models like ResNet training from examples repo and check if they saturate both GPUs?
PCI-e communication latency, transfer overheads, and weight sync on CPU mean that small models don’t benefit from multi-GPU training.
Can you try large image models like ResNet training from examples repo and check if they saturate both GPUs?
@bottanski @magic282 I am observing the same. And the model is actually not speeded up at all in my scenario.
Hi!
I am also observing that using DataParallel() is not providing speedup when using multiple GPUs. In our case we are using a deepspeech implementation in PyTorch.
For the 5-layer bi-directional GRU (38 million parameters most of which are in the GRU layers) it takes 18 minutes and 6 seconds on 2 GPUs and 18 minutes and 34 seconds on 1 GPU (70 epochs on the smaller an4 dataset). For 1 GPU we used a batchsize of 20 whereas for 2 GPUs we used a batchsize of 40.
Are there any suggestions on how to get faster times with multiple-GPUs?
I can provide more data or run more experiments if that is helpful!
To @smth @ngimel @apaszke @mattmacy
I wonder if you could help me to straighten up my puzzle over the PyTorch Multi-GPU acceleration issue:
Here is what @smth mentioned about how PyTorch training works in Muti-GPU mode:
in forward:
…
gather output mini-batch from GPU1, GPU2 onto GPU1in backward:
scatter grad_output and input
…
However, my personal thought is that:
in a Back Propagation algorithm running on N GPUs, after you finish N forward caculations on each GPU, you do not have to gather the forward together onto GPU 0, caculate the loss on GPU 0, then scatter the grad_output and input to N GPUs before the subsequent backward calculation. I just got confused when trying understand why you do not do what I mentioned in the following – which might speed up the PyTorch Multi-GPU training.
in forward:
scatter mini-batch to GPU1, GPU2
replicate model on GPU2 (it is already on GPU1)
model_gpu1(input_gpu1), model_gpu2(input_gpu2) (this step is parallel_apply)
parallel_apply model’s backward pass
reduce GPU2 replica’s gradients onto GPU1 model
I have read the code here I understand it is working as @smth metioned.
Seems to me that the forward function here has a fixed structure of ‘scatter-caculate-gather’ (or ‘map-then-reduce’). It cannot do ‘scatter-forward calculate-backward calculate-gather’ So the current PyTorch 0.4.0 does not allow some procedure I mentioned above which might futher accelerate the PyTorch, I think. Am I wrong about that?
P.S.: I investigate into this for I noticed that PyTorch Multi-GPU acceleration does not work quite well for me. E.g. for 2GPUs, I can only get 5%-30% acceleration when trying different model topologies.
Any thoughts?
here is a solution about data parallel http://hangzh.com/PyTorch-Encoding/_modules/encoding/parallel.html#ModelDataParallel , and i read it and try a small net use it code but get trash on backward step
hello.
I am curious about the implementation of accumulate the gradients
, but I can’t find out where the code that do this job. Can you give a hint ?
Thanks !