RuntimeError when use DataParallel

yf_zhang · April 9, 2020, 2:25pm

class Add(nn.Module):
    def __init__(self, scope=None):
        nn.Module.__init__(self)

    def forward(self, bottoms):
        # bottoms is a list of tensor.
        assert len(bottoms) > 1, "The length of bottoms must be larger than 1."
        result = bottoms[0]
        for i in range(1, len(bottoms)):
            result = result + bottoms[i]
        return [result]

My model use the Add layer, it’s the subclass of nn.Module. When I use DataParallel to train my model, the runerror happened, “Expected tensor for argument #1 ‘input’ to have the same device as tensor for argument #2 ‘weight’;”. It’s OK when I use a single GPU;
What should I do make it train normally;

charan_Vjy · April 9, 2020, 2:47pm

Make sure you mount the data on the same device as that of the model. When you load the model use, torch.nn.DataParallel(model, device_ids=[0, 1,2,..]). The first device(in the above example, device 0 will be where the model is mounted before it is replicated).
Make sure you mount input on this device itself. The first gpu will be used as a platform to scatter it onto other devices.