Data Parallel for submodules not working properly

Hi, I have 4 separate nn.Modules which I instantiate inside my actual final model class (this is also nn.Module), so I have something like,

class M(nn.Module):
    def __init__():
          super(M, self).__init__()
          self.A = A()
          self.B = B()
          self.C = C()
          self.D = D()

for training I do the following :

model = M().to(device)
model = nn.DataParallel(model)

However I do not see the other 3 GPUs get utilised at all in a 4 GPU machine.
I have done export CUDA_VISIBLE_DEVICES=0,1,2,3 before starting training.

Can someone give an insight to what I am doing wrong ?

One thought: Is your batch size larger than the number of GPUs? If smaller than it wouldn’t be able to divide your batch across the available GPUs.

Another thought: note that having 4 submodules and 4 GPUs won’t use one submodule per GPU, if that’s what you intended (unless perhaps only one module is selected and used in a forward pass).

What is your nvidia-smi output when running? What is your device variable?