When training using 4 gpu for segmentation task. The GPU memory usage of the first one is much larger than the others. Any thoughts? Thanks!
Maybe some buffers used by the optimizer, such as the momentum, whose size equals to that of the model parameters.
Maybe that’s the case. Thanks!
I found this:
It seems always gathers the output to the first GPU. Is this a temporary solution?
do you want to retain the outputs on their respective GPUs without being gathered back onto a particular GPU?
If so, you want to do the scatter
, parallel_apply
and avoid gather
yourself. These are primitives under nn.parallel
. DataParallel is effectively a composition scatter + parallel_apply + gather
Thanks smth!
I figured it out, exactly as you said https://github.com/pytorch/pytorch/issues/1893
What is the advantage or benefit about not gather the outputs to a single GPU? Or is it have some disadvantages?