Bug in Data Parallel?

Rafael_R · August 19, 2019, 2:38am

Hi,

I have a network having two components (X, Y) which takes two inputs: A and B.

I wrap the network in data parallel. X returns full batch size output (how?) while Y returns 1/4th batch size output (this is expected).

Should I not nest subclasses of nn.Module while using Data parallel wrapper? Since, data parallel does support multiple inputs so this is unexpected (https://github.com/pytorch/pytorch/pull/794).

I even tried separating the X, Y modules and wrapping them in data parallel separately, I still get the same error.

Any hints on what might be wrong?

ptrblck · August 19, 2019, 11:09am

Could you post a minimal code snippet to see, how you are using submodules inside your model and how you are applying nn.DataParallel to it?

Rafael_R · August 19, 2019, 7:41pm

I am getting the same error in two ways:

> class EncoderCNN(nn.Module):
>     def __init__(self, embed_size, pre_trained_emb_size=2048):
>         """Load the pretrained ResNet Features"""
>         super(EncoderCNN, self).__init__()
>         self.linear = nn.Linear(pre_trained_emb_size, embed_size)
>         self.relu = nn.ReLU(embed_size)
>         self.bn = nn.BatchNorm1d(embed_size, momentum=0.01)
> 
>     def forward(self, images):
>         """Extracted feature vectors form input"""
>     
>         features = self.bn(self.relu(self.linear(images)))
>         return features
> 
> class TransformerEncoder(nn.Module):
>     def __init__(self, vocab_size, embed_size, d_model=300, nhead=6, dim_feedforward=512, dropout=0.1):
>         super(TransformerEncoder, self).__init__()
>         self.input_size = vocab_size
>         self.embed_size = embed_size
>         self.embedding = nn.Embedding(self.input_size, self.embed_size)
>         # Encoder CNN
>         self.encoder_cnn = EncoderCNN(embed_size=self.embed_size)
>         self.transformer_encoder_layer = nn.TransformerEncoderLayer(d_model, nhead, dim_feedforward)
> 
>     def forward(self, input, input_lengths, images):
>         input = input.reshape((input.shape[1], input.shape[0]))
>         embedded = self.embedding(input)
>         output = self.transformer_encoder_layer(embedded)
>         image_encodings = self.encoder_cnn(images)
>         return output, image_encodings

Even if I take the encoder cnn module out and wrap it in data parallel separately, I get the same error. the final returned tensors output, image_encodings have different batch sizes

ptrblck · August 19, 2019, 10:30pm

I’m not sure, why you are reshaping the input in TransformerEncoder's forward method.
Could you explain this work flow as I think it might be related to this issue?

Rafael_R · August 19, 2019, 10:45pm

Thanks,

I am sending max_len x batch size x embedding_size input to transformer layer. This is the same api as in LSTM (without batch_first=True). Since, data parallel module expects the module’s input to have batch in first dimension, I am passing batch as first dimension and then reshaping it before passing it to transformer.
Same for images, batch_size x embedding_size input is passed to the module.

Do you see anything wrong here?

ptrblck · August 19, 2019, 10:58pm

Thanks for the information.
In that case you might want to use .permute, as .reshape might interleave the data:

x = torch.tensor([[0., 0.],
                  [1., 1.],
                  [2., 2.],
                  [3., 3.]])

print(x.reshape(x.size(1), x.size(0)))
> tensor([[0., 0., 1., 1.],
          [2., 2., 3., 3.]])
print(x.permute(1, 0))
> tensor([[0., 1., 2., 3.],
          [0., 1., 2., 3.]])

I don’t see any obvious errors. Could you add print statements inside the forward method, which will print the shape as well as the device of the input, output and each intermediate tensor?

Rafael_R · August 19, 2019, 11:14pm

Thanks,

The print statements and the output on the console is:

    def forward(self, input, input_lengths, images):
        ptvsd.break_into_debugger()
        print("Input", input.shape, input.get_device())
        input = input.permute((1, 0))  #input.reshape((input.shape[1], input.shape[0]))
        print("Permuted Input", input.shape, input.get_device())
        embedded = self.embedding(input)
        print("Embedded", embedded.shape, embedded.get_device())
        #packed = torch.nn.utils.rnn.pack_padded_sequence(embedded, input_lengths)
        output = self.transformer_encoder_layer(embedded)
        print("Output", output.shape, output.get_device())
        #output, _ = torch.nn.utils.rnn.pad_packed_sequence(output)
        image_encodings = self.encoder_cnn(images)
        print("Image Encodings", image_encodings.shape, image_encodings.get_device())
        return output, image_encodings

And the output is:

Input torch.Size([64, 30]) 0
Input torch.Size([64, 30]) 1
Permuted Input torch.Size([30, 64]) 0
Input torch.Size([64, 30]) 2
Permuted Input torch.Size([30, 64]) 1
Permuted Input torch.Size([30, 64]) 2
Input torch.Size([64, 30]) 3
Embedded torch.Size([30, 64, 300]) 0
Embedded torch.Size([30, 64, 300]) 2
Embedded torch.Size([30, 64, 300]) 1
Output torch.Size([30, 64, 300]) 1
Output torch.Size([30, 64, 300]) 0
Output torch.Size([30, 64, 300]) 2
Permuted Input torch.Size([30, 64]) 3
Image Encodings torch.Size([64, 300]) 1
Image Encodings Image Encodingstorch.Size([64, 300]) 2
 torch.Size([64, 300]) 0
Embedded torch.Size([30, 64, 300]) 3
Output torch.Size([30, 64, 300]) 3
Image Encodings torch.Size([64, 300]) 3

And the returned tensors have the shape:

shape:torch.Size([120, 64, 300])
device:device(type='cuda', index=0)

and

shape:torch.Size([256, 300])
device:device(type='cuda', index=0)

ptrblck · August 19, 2019, 11:21pm

It looks like you would need to permute the output of the TransformerEncoderLayer again, as its output has the shape [T, N, E] (batch dimension in dim1).

Rafael_R · August 20, 2019, 9:20pm

thanks, the data parallel issue is solved it seems and the code is working. however, it is very slow (slower than a single gpu ) and i am getting the error here: How to flatten parameters?

any hints on how to fix the issue, i assume flatten lstm must be called after each call but then even if i do it, i still get the linked warning and the code is very slow