Different output length on GPUs nn.DataParallel for paraphrasing

Hi, I am using nn.DataParallel module to use multiple GPUs for my Pegasus Model.

The problem I am facing is that when DataParallel divides a batch onto multiple GPUs then the length of paraphrases produced by the PegasusForConditionalGeneration on each GPU doesn’t have equal length as length of output depends upon input.
I can not force model to produce a fixed-length output (by truncating longer ones and padding smaller ones) as I don’t want to truncate bigger paraphrases or unnecessarily pad paraphrases to a large number
GPU:0 produces output of 48 length while GPU:1 of 26. Is there a way to solve this problem

/torch/nn/parallel/comm.py", line 235, in gather
    return torch._C._gather(tensors, dim, destination)
RuntimeError: Input tensor at index 1 has invalid shape [15, 48], but expected [15, 26]

Hi, this seems like it might be a possible bug with DataParallel, but a repro with some example model and training code would be needed to confirm. If you’re able to get that could you please post an issue to Issues · pytorch/pytorch · GitHub?