Multi GPU training, memory usage in-balance

HANG_ZHANG · June 20, 2017, 12:19am

When training using 4 gpu for segmentation task. The GPU memory usage of the first one is much larger than the others. Any thoughts? Thanks!

Cysu · June 20, 2017, 7:18am

Maybe some buffers used by the optimizer, such as the momentum, whose size equals to that of the model parameters.

HANG_ZHANG · June 20, 2017, 6:15pm

Maybe that’s the case. Thanks!

HANG_ZHANG · June 23, 2017, 4:43am

I found this:

github.com

pytorch/pytorch/blob/master/torch/nn/parallel/data_parallel.py#L47


    >>> output = net(input_var)
"""


# TODO: update notes/cuda.rst when this class handles 8+ GPUs well


def __init__(self, module, device_ids=None, output_device=None, dim=0):
    super(DataParallel, self).__init__()


    if not torch.cuda.is_available():
        self.module = module
        self.device_ids = []
        return


    if device_ids is None:
        device_ids = list(range(torch.cuda.device_count()))
    if output_device is None:
        output_device = device_ids[0]
    self.dim = dim
    self.module = module
    self.device_ids = device_ids
    self.output_device = output_device

It seems always gathers the output to the first GPU. Is this a temporary solution?

smth · June 27, 2017, 9:16pm

do you want to retain the outputs on their respective GPUs without being gathered back onto a particular GPU?
If so, you want to do the scatter, parallel_apply and avoid gather yourself. These are primitives under nn.parallel. DataParallel is effectively a composition scatter + parallel_apply + gather

HANG_ZHANG · June 28, 2017, 5:46am

Thanks smth!
I figured it out, exactly as you said https://github.com/pytorch/pytorch/issues/1893

ycszen · July 22, 2017, 11:39am

What is the advantage or benefit about not gather the outputs to a single GPU? Or is it have some disadvantages?