Using dataparallel for custom classes

Hi,
I have a class for my model that uses some other model in it. For example:

class Mynet1(nn.Module):
def init(self):
super(NN_model, self).init()
self.fc1 = nn.Linear(4160, 500)

class Mynet2(nn.Module):
def init(self, Mynet):
super(NN_model, self).init()
self.fc1 = nn.Linear(4160, 500)
self.layer = Mynet(…)

mynet1=Mynet1()
model = torch.nn.Dataparallel(Mynet2(mynet1))
model.cuda()
So if I want to run Mynet2 on multiple devices. I will do torch.nn.Dataparallel(Mynet2). Do I also need to take Mynet1 to multiple devices as in do torch.nn.Dataparallel(Mynet1) and then send it to Mynet2?
Or it will automatically be on multiple devices since I take Mynet2 to multiple devices.

One more question related to dataparallel. Let’s say I have my custom conv2d class that uses torch variables like torch.mul, torch.sub etc. So when I was running this code on a single device I used to do “.to(device)” for all the intermediate variable in my custom conv2d class. So now when I run a model using this custom conv2d class on multiple devices using dataparallel I keep on getting this error: “RuntimeError: Expected tensor for argument #1 ‘input’ to have the same device as tensor for argument #2 ‘weight’; but device 0 does not equal 1 (while checking arguments for cudnn_batch_norm)”.

I have taken my model that uses this custom conv2d class to multiple devices using torch.nn.dataparallel. So I don’t understand why I keep on getting this error. Is this because the intermediate variable that I have inside my custom conv2d class are on single device?
If so how can I take this intermediate variable to multiple devices?
Should I define them as model parameters using torch.nn.Parameter?

Thanks

nn.DataParallel will automatically create the model replica with all internal submodules, parameters, and buffers on the specified devices, so you can wrap the main model only into nn.DataParallel.

It depends, how you’ve created these tensors.
All registered nn.Parameters and buffers (via self.register_buffer('name', torch.randn(1))) will be pushed automatically to the corresponding device.
However, if you have created a plain tensor and pushed it to a device via to('cuda'), you would either have to use the device argument of another parameter or register it as an nn.Parameter (if it needs gradients) of buffer.

Thanks for the reply @ptrblck. So I won’t even have to take mynet1 instance to the device before passing it as an argument to Mynet2?

Regarding the answer to my second question. I actually don’t need the gradients for these intermediate variables. Can you please explain a bit more on what you mean by " However, if you have created a plain tensor and pushed it to a device via to('cuda') , you would either have to use the device argument of another parameter or register it as an nn.Parameter (if it needs gradients) of buffer"

One solution to this problem I feel would be to define all my intermediate variables as nn.Parameters and make require_grad=False. Also, I can define them self.register_buffer and make grad=False. Will both solutions be similar?

That is correct. Your mynet1 would act as a “standard” module, e.g. similar to nn.Linear or nn.Conv2d, so you can just pass it into the parent model (Mynet2) and call to() or cuda() on the parent.

All nn.Modules, nn.Parameters and buffers will be properly moved to the device, if to() or cuda()/cpu() is called on the parent device, while plain tensors will not. This code snippet demonstrates the behavior:

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.module = nn.Linear(1, 1)
        self.param = nn.Parameter(torch.randn(1, 1))
        self.register_buffer('buf', torch.randn(1, 1))
        self.ten = torch.randn(1, 1)

    def forward(self, x):
        return self.module(x) + self.param + self.buf + self.ten
        
model = MyModel()
print(model.module.weight.device)
print(model.param.device)
print(model.buf.device)
print(model.ten.device)
out = model(torch.randn(1, 1)) # works
> cpu
cpu
cpu
cpu

model.to('cuda')
print(model.module.weight.device)
print(model.param.device)
print(model.buf.device)
print(model.ten.device)
out = model(torch.randn(1, 1).to('cuda')) # fails, as self.ten is still on the CPU
> cuda:0
cuda:0
cuda:0
cpu
RuntimeError: expected device cuda:0 but got device cpu

If you don’t want to train this tensor, use a buffer. Using a parameter with requires_grad=False would also work, but a buffer would be the cleaner approach.

1 Like

@ptrblck
Actually for my conv2d function I am using autograd Functions. Like below

class Conv2d_function(Function):

@staticmethod
def forward(ctx, x1,x2):
    
    y1 = torch.tensor([1,2,3]).to(device)
    y2 = torch.tensor([4,5,6]).to(device)
    ....

class Conv2d(nn.Module):
def init(self, in_channels …):


super(Conv2d, self).init()

def forward(self, input):
    return Conv2d_function.apply(input)

Creating model using custom conv2d class
class Model():
layer1 = Conv2d(…)
layer2 = Conv2d(…)

model1 = Model()

So when I use Dataparallel for this model1 I get this error “RuntimeError: Expected tensor for argument #1 ‘input’ to have the same device as tensor for argument #2 ‘weight’; but device 0 does not equal 1 (while checking arguments for cudnn_batch_norm)”

Is this error occurring because in the forward function of Conv2d_function class I am defining a tensor using .to(device)?

Actually the tensor y1 and y2 depend on my input to the forward function of class Conv2d so I can’t define those tensor in the init of Conv2d class as register_buffer or Parameter. So I can only define those in my forward function of Conv2d_function class. But with data-parallel I get this above error which I feel is because I am transferring these tensors to GPU in my forward pass.

Yes, and since you are not specifying the device id, the default device will be used in each model replica.

You could use the device attribute of the incoming tensors instead:

y1 = torch.tensor([1,2,3]).to(x1.device)
y2 = torch.tensor([4,5,6]).to(x1.device)

Ok. A quick question in line with this. So when we pass input to our model. We run a for loop over the dataloader as shown below:
device = torch.device(“cuda:0” if torch.cuda.is_available() else “cpu”)
for batch_idx,(data, target) in enumerate(testloader):
data_var = data.to(device)
target_var = target.to(device)
output = model(data_var)
So does dataparallel takes this input to all devices when we use data parallel?
Because while passing input (data_var) to my model I take it to default device.
Does pytorch create copies of input image as well on all the GPUs when we use dataparallel?

Because I want my tensors (y1, y2 …) to be on all the devices and now I will be initializing these tensors using x1.device

nn.DataParallel will split the input tensor in dim0 and will send each chunk to the corresponding model replica on a specific device.
I.e. your model’s forward will get an input of the shape [batch_size//nb_gpus, *].
If you need to create y1 and y2 inside the forward method, you should also consider the current batch size of the input.
Also, it is correct to push the input tensor in the DataLoader loop to the default device, since nn.DataParallel will take care of the scattering and gathering of the tensors and parameters.