DistributedDataParallel do not work with custom function in model

hadaev8 · November 20, 2019, 4:35pm

Im trying to use this model

NVIDIA/tacotron2/blob/master/model.py#L487


      
              input_lengths = to_gpu(input_lengths).long()
              max_len = torch.max(input_lengths.data).item()
              mel_padded = to_gpu(mel_padded).float()
              gate_padded = to_gpu(gate_padded).float()
              output_lengths = to_gpu(output_lengths).long()
          
              return (
                  (text_padded, input_lengths, mel_padded, max_len, output_lengths),
                  (mel_padded, gate_padded))
          
          def parse_output(self, outputs, output_lengths=None):
              if self.mask_padding and output_lengths is not None:
                  mask = ~get_mask_from_lengths(output_lengths)
                  mask = mask.expand(self.n_mel_channels, mask.size(0), mask.size(1))
                  mask = mask.permute(1, 0, 2)
          
                  outputs[0].data.masked_fill_(mask, 0.0)
                  outputs[1].data.masked_fill_(mask, 0.0)
                  outputs[2].data.masked_fill_(mask[:, 0, :], 1e3)  # gate energies
          
              return outputs

But getting this error

File “/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py”, line 60, in worker
output = module(*input, **kwargs)
File “/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 547, in call
result = self.forward(*input, **kwargs)
File “/home/ec2-user/SageMaker/tacotron2/model/model.py”, line 480, in forward
output_lengths)
File “/home/ec2-user/SageMaker/tacotron2/model/model.py”, line 452, in parse_output
outputs[0].data.masked_fill(mask, 0.0)
RuntimeError: The expanded size of the tensor (1079) must match the existing size (836) at non-singleton dimension 2. Target sizes: [4, 80, 1079]. Tensor sizes: [4, 80, 836]

How to solve it?

pritamdamania87 · November 26, 2019, 8:06pm

Could you provide your DDP code to reproduce the issue? Also, does the model work properly without DDP?

hadaev8 · November 29, 2019, 3:21pm

https://colab.research.google.com/drive/104LtQ1zIioIOMQEPgVve77m5Rd4Gm0wU
Yes, it is works fine, also same code works fine on single gpu colab instance.
Tested on 8 v100 instance from amazon.