Multi GPU training, memory usage in-balance

When training using 4 gpu for segmentation task. The GPU memory usage of the first one is much larger than the others. Any thoughts? Thanks!

1 Like

Maybe some buffers used by the optimizer, such as the momentum, whose size equals to that of the model parameters.

3 Likes

Maybe that’s the case. Thanks!

I found this:

It seems always gathers the output to the first GPU. Is this a temporary solution?

do you want to retain the outputs on their respective GPUs without being gathered back onto a particular GPU?
If so, you want to do the scatter, parallel_apply and avoid gather yourself. These are primitives under nn.parallel. DataParallel is effectively a composition scatter + parallel_apply + gather

4 Likes

Thanks smth!
I figured it out, exactly as you said https://github.com/pytorch/pytorch/issues/1893

3 Likes

What is the advantage or benefit about not gather the outputs to a single GPU? Or is it have some disadvantages?

1 Like