My input batch is not splitted on the number of GPUs

Hi, I have a machine with 4 GPUs, my program went smoothly, I did as always:

    model = model.cuda()
    model = nn.DataParallel(model) 

following that I did:

        images = images.cuda()
        target = target.cuda()
        output = model(images)

My input batch is 64, 16 on each GPU, everything as planned.

Things are getting wrong when I’m splitting my model to have 2 forwards:

        midoutput = model.module.forward1(images)
        output = model.module.forward2(midoutput)

Then although I did the exect same procedure as mentioned above, inside forward1 I’m getting a batch of 64 and not 16 as expected.

when I’m using GPUtil I’m getting:
Capture

So, it seems that although I used the same procedure, and I have 4 GPU, only 1 GPU is working now.

Any Idea?

Thanks a lot!

Maybe you splitting the original forward() method in two different ones breaks inheritence and/or autograd in some way?

Why do you need to split it in the first place? I think you could retrieve the midoutput result in the original forward by keeping an intermediate variable and returning it as well as the final output… Something like that:

def forward(x):
    x = self.layer1(x)
    midoutput = x
    x = self.layer2(x)
    return midoutput, x

and then:

midoutput, output = model(images)

I want to loop on one of them , in a way that is not possible using 1 forward function.

I see! Maybe you could nest two different modules, keeping only one forward but looping on the submodule’s forward inside it?

Thanks, but I need to have 2 different forward functions, It’s important for to understand why doing that “breaking” the dataparallel settings, I think it’s also important for the community.

Sure, I understand.

Then digging a bit in the doc shows why it acts like this: check the source code for DataParallel here.

What happens when you call model(images) is that you call the DataParallel module’s forward method, which handles all the parallelism. Inside the forward method, the submodules (your model split in the number of GPUs you specified) are called with return self.module(*inputs[0], **kwargs[0]), which in turn calls the forward method of the module itself… Which is not forward1 or forward2 in your case, and it doesn’t split it on multiple GPUs.

Hence my recommendations of keeping the original forward method or having two nested modules.

One solution I can imagine is inheriting the DataParallel class, and overriding the forward method with your own looping behaviour, but doing it without breaking anything will probably be a bit complex…

1 Like

Thanks for your investigation, I saw the source code and I think you are right.
I will try your solution of two nested modules, can you give a bit more details about this solution?
I can also do something like:

def forward(self,x,flag):
    if flag:
       # do foraward1 code
    else:
       # do foraward2 code

And then:

        images = images.cuda()
        target = target.cuda()
        output = model(images,flag=True)
        output = model(images,flag=False)

Do you see any drawback to do the above path?

I think the second solution should work! I can’t see a reason why it wouldn’t, at least. Did you try it?

For nested modules, here’s a simplified example:

from torch import nn

class SubModule(nn.Module):
    def __init__(self):
        super(SubModule, self).__init__()
        self.layer1 = ...
    def forward(self, x):
        return self.layer1(x)

class Module(nn.Module):
    def __init__(self, submodule):
        super(Module, self).__init__()
        self.submodule = submodule
        self.layer2 = ...
    def forward(self, x):
        x = self.submodule(x)
        x = self.layer2(x)
        return x