When training using 4 gpu for segmentation task. The GPU memory usage of the first one is much larger than the others. Any thoughts? Thanks!
Maybe some buffers used by the optimizer, such as the momentum, whose size equals to that of the model parameters.
Maybe that’s the case. Thanks!
I found this:
It seems always gathers the output to the first GPU. Is this a temporary solution?
do you want to retain the outputs on their respective GPUs without being gathered back onto a particular GPU?
If so, you want to do the
parallel_apply and avoid
gather yourself. These are primitives under
nn.parallel. DataParallel is effectively a composition
scatter + parallel_apply + gather
I figured it out, exactly as you said https://github.com/pytorch/pytorch/issues/1893
What is the advantage or benefit about not gather the outputs to a single GPU? Or is it have some disadvantages?