Let’s say I have the following forward function in a given defined module
class Classifier(nn.Module): def __init__(self,args*): super(Classifier, self).__init__() # defining some layers def forward(self, feature_batch, tensor_A, *args): # some operations between feature_batch and tensor_A return results
This module is a part of my global model. and tensor_A is a learnable parameter passed from another module.
# feature_batch.shape : (32, 128) # tensor_A.shape : (500, 128) # is supposed to be like this # feature_batch_expanded.shape : (32, 500, 128) # tensor_A_expended.shape : (32, 500, 128)# is supposed to be like this
I have 2 GPUs, I am doing dataparallelization before startning the training:
global_model = nn.DataParallel(global_model).to(self.device)
The problem I am having is shapes mismatch during the operation because tensor_A is not what I am waiting for:
# tensor_A.shape : (250, 128)
when I print the shapes inside the forward, this is what i get:
# feature_batch.shape : (32, 128) # tensor_A.shape : (250, 128) # feature_batch.shape : (32, 128) # tensor_A.shape : (250, 128)
the forward pass is called two times before starting the operations.
my questions are:
1 - why feature_batch is not being split on two GPUs? I mean when I print the shape I dont have two times (16, 128). means the feature_batch is passed only to one GPU
2 - How can I stop pytorch from thinking that 500 is the batch size, and so stopping it from splitting the tensor_A. is there an effective solution to this without having gradient problems.