How to average the usage on multi GPU?

I’m using multiple GPU for training ,But I found the low efficiency of my GPU usage,The GPU 0 is fully loaded,while the last one is nearly idle.I realize this fact by their RAM usage in the pic,Could anyone explain this phenomenon to me and tell me how to change this situation?thx!


Hi, check this blog post

I tried to use your modified DataParallel but it failed.
It gaves me the error: assert len(targets) == len(inputs)
I do not know where to debug.
My network returns two items, a list with tensor inside, a tensor.
I also tried split the list before network return,it also as include the final tensor inside the list.
Can U help me?I followed the direction on your blog.

Try this one

It’s roughly the same

so,what should I do?where and how to modify?

Do you mean to use the original nn.Dataparallel instead of the method mentioned in previous answer?
and I noticed the final modification is the

return torch.unsqueeze(loss,0)

Am I right?

I mean that in that post the Imbalance is discussed, and I provide a working code to compute the loss inside the DataParallel as in the blog @Thomas_Wolf suggested.

Just saying that, in case the code provided there does not work for you, you can try mine one.
I didn’t checked the blog’s code.

Yeah,I got it finally ,thank u ! btw ,what is the purpose of returning the loss? why it’s being unsqueezed?

and how to set parameter tobe trained and the save process?
the Fullmodel.parameter or model.parameter or parallel.parameter?
meantime ,what to save ?

The trick is that if you compute the loss outside DataParallel, outputs are collected in a single GPU. Then loss is calculated and backpropagated.

If you compute the loss inside DataParallel, instead of returning huge amount of data (that will be collected in a single gpu), you just return few floats.

Loss is unsqueezed because dataparallel works with batches, so it fails at the time of returning a non-dimensional tensor. Besides, non-dimensional tensors are supported in last versions of PyTorch, but its not something implemented in the early design.

You can set your model, optimizer parameters etcetera and then call the function. As you can see it requires a model as input so it’s agnostic.
At the time of saving I would save the base model.

If you check the state dict it will appear couple more roots, like datap.module.fullmodel.model
so you can save datap.module.fullmodel.model.state_dict()

Check that deeper

TypeError: _gather(): incompatible function arguments. The following argument types are supported:
    1. (tensors: List[at::Tensor], dim: int, destination_index: Optional[int]) -> at::Tensor

Invoked with: (tensor([8.9125], device='cuda:0', grad_fn=<UnsqueezeBackward0>), tensor([7.4709], device='cuda:1', grad_fn=<UnsqueezeBackward0>), tensor([6.6035], device='cuda:2', grad_fn=<UnsqueezeBackward0>), tensor([8.3252], device='cuda:3', grad_fn=<UnsqueezeBackward0>)), 0, [3]

It works not so well…
my return is

return torch.unsqueeze(loss1+loss2,0)

cause my loss is computed in two loss functions

1 Like
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import torch
import torch.nn as nn

class FullModel(nn.Module):
  def __init__(self, model, loss):
    super(FullModel, self).__init__()
    self.model = model
    self.loss = loss

  def forward(self, targets, *inputs):
    outputs = self.model(*inputs)
    loss = self.loss(outputs, targets)
    return torch.unsqueeze(loss,0),outputs

def DataParallel_withLoss(model,loss,**kwargs):
    model=FullModel(model, loss)
    if 'device_ids' in kwargs.keys():
    if 'output_device' in kwargs.keys():
    if 'cuda' in kwargs.keys():
        model=torch.nn.DataParallel(model, device_ids=device_ids, output_device=output_device).cuda(cudaID)
        model=torch.nn.DataParallel(model, device_ids=device_ids, output_device=output_device).cuda()
    return model
class toy(nn.Module):
    def __init__(self):
        super(toy, self).__init__()
        self.bias = torch.nn.Parameter(torch.tensor([0.0]),requires_grad=True)
        self.sig= nn.Sigmoid()
    def forward(self,x):
        return self.sig(x+self.bias)
model = toy()
optimizer = torch.optim.SGD(model.parameters(),lr=1)
loss = torch.nn.L1Loss()
model = DataParallel_withLoss(model,loss)
gt = torch.tensor([0.0])
input = torch.tensor([0.0])
loss,_ = model(gt,input)

This runs
(toy example)