How to average the usage on multi GPU?

Kevinkevin189 · November 13, 2018, 7:44am

I’m using multiple GPU for training ,But I found the low efficiency of my GPU usage,The GPU 0 is fully loaded,while the last one is nearly idle.I realize this fact by their RAM usage in the pic,Could anyone explain this phenomenon to me and tell me how to change this situation?thx!

GPU RAM Usage

Thomas_Wolf · November 13, 2018, 9:33am

Hi, check this blog post

Kevinkevin189 · November 28, 2018, 12:35pm

I tried to use your modified DataParallel but it failed.
It gaves me the error: assert len(targets) == len(inputs)
I do not know where to debug.
My network returns two items, a list with tensor inside, a tensor.
I also tried split the list before network return,it also failed.so as include the final tensor inside the list.
Can U help me?I followed the direction on your blog.

JuanFMontesinos · November 28, 2018, 1:59pm

Try this one

It’s roughly the same

Kevinkevin189 · November 28, 2018, 2:38pm

so,what should I do?where and how to modify?

Kevinkevin189 · November 28, 2018, 2:47pm

Do you mean to use the original nn.Dataparallel instead of the method mentioned in previous answer?
and I noticed the final modification is the

return torch.unsqueeze(loss,0)

Am I right?

JuanFMontesinos · November 28, 2018, 3:18pm

I mean that in that post the Imbalance is discussed, and I provide a working code to compute the loss inside the DataParallel as in the blog @Thomas_Wolf suggested.

Just saying that, in case the code provided there does not work for you, you can try mine one.
I didn’t checked the blog’s code.

Kevinkevin189 · November 28, 2018, 3:27pm

Yeah，I got it finally ,thank u ! btw ,what is the purpose of returning the loss? why it’s being unsqueezed?

Kevinkevin189 · November 28, 2018, 3:32pm

and how to set parameter tobe trained and the save process?
the Fullmodel.parameter or model.parameter or parallel.parameter?
meantime ,what to save ?

JuanFMontesinos · November 28, 2018, 3:33pm

The trick is that if you compute the loss outside DataParallel, outputs are collected in a single GPU. Then loss is calculated and backpropagated.

If you compute the loss inside DataParallel, instead of returning huge amount of data (that will be collected in a single gpu), you just return few floats.

Loss is unsqueezed because dataparallel works with batches, so it fails at the time of returning a non-dimensional tensor. Besides, non-dimensional tensors are supported in last versions of PyTorch, but its not something implemented in the early design.

You can set your model, optimizer parameters etcetera and then call the function. As you can see it requires a model as input so it’s agnostic.
At the time of saving I would save the base model.

If you check the state dict it will appear couple more roots, like datap.module.fullmodel.model
so you can save datap.module.fullmodel.model.state_dict()

Check that deeper

Kevinkevin189 · November 28, 2018, 3:36pm

TypeError: _gather(): incompatible function arguments. The following argument types are supported:
    1. (tensors: List[at::Tensor], dim: int, destination_index: Optional[int]) -> at::Tensor

Invoked with: (tensor([8.9125], device='cuda:0', grad_fn=<UnsqueezeBackward0>), tensor([7.4709], device='cuda:1', grad_fn=<UnsqueezeBackward0>), tensor([6.6035], device='cuda:2', grad_fn=<UnsqueezeBackward0>), tensor([8.3252], device='cuda:3', grad_fn=<UnsqueezeBackward0>)), 0, [3]

It works not so well…
my return is

return torch.unsqueeze(loss1+loss2,0)

cause my loss is computed in two loss functions

JuanFMontesinos · November 28, 2018, 3:52pm

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import torch
import torch.nn as nn

class FullModel(nn.Module):
  def __init__(self, model, loss):
    super(FullModel, self).__init__()
    self.model = model
    self.loss = loss

  def forward(self, targets, *inputs):
    outputs = self.model(*inputs)
    loss = self.loss(outputs, targets)
    return torch.unsqueeze(loss,0),outputs
    

def DataParallel_withLoss(model,loss,**kwargs):
    model=FullModel(model, loss)
    if 'device_ids' in kwargs.keys():
        device_ids=kwargs['device_ids']
    else:
        device_ids=None
    if 'output_device' in kwargs.keys():
        output_device=kwargs['output_device']
    else:
        output_device=None
    if 'cuda' in kwargs.keys():
        cudaID=kwargs['cuda'] 
        model=torch.nn.DataParallel(model, device_ids=device_ids, output_device=output_device).cuda(cudaID)
    else:
        model=torch.nn.DataParallel(model, device_ids=device_ids, output_device=output_device).cuda()
    return model
class toy(nn.Module):
    def __init__(self):
        super(toy, self).__init__()
        self.bias = torch.nn.Parameter(torch.tensor([0.0]),requires_grad=True)
        self.sig= nn.Sigmoid()
    def forward(self,x):
        return self.sig(x+self.bias)
model = toy()
optimizer = torch.optim.SGD(model.parameters(),lr=1)
loss = torch.nn.L1Loss()
model = DataParallel_withLoss(model,loss)
gt = torch.tensor([0.0])
input = torch.tensor([0.0])
loss,_ = model(gt,input)
optimizer.zero_grad()
loss.backward()
optimizer.step()

This runs
(toy example)