Question about: Packed RNN with DataParallel

Hi,

I am building an OCR model in PyTorch and running into an issue with DataParallel and Packed RNN sequences.

(1) My code seems pretty straightforward:

def forward(self, input, input_lengths):
    packed_input = nn.utils.rnn.pack_padded_sequence(input, input_lengths)
    packed_output = self.lstm(packed_input)
    output, _ = nn.utils.rnn.pad_packed_sequence(packed_output)
    return output

It works fine on a single GPU, but when I try to use DataParallel to run on multiple GPUs I run into a problem because the unpacked tensors on each GPU are of differing number of time steps, depending on the maximum value of the chunk of the input_lengths array that got passed to each GPU.

I am resolving the problem by doing this:

padded_output = Variable(torch.zeros(max_T, output.size()[1], output.size()[2]))
padded_output[:input_lengths.max(), :, :] = output
return padded_output

I just want to check with folks who are more familiar with PyTorch to see if this is the appropriate way to resolve this problem, or if there is a better way?

(2) Another issue, that is easily corrected by a small change to DataParallel, is that my model takes in an image tensor of dimension BCHW and outputs a set of LSTM outputs of dimension TBF – i.e. when I scatter inputs to the various GPUs I want to split on the batch dimension of my input vector, 0, and when I gather outputs from the various GPUs I want to stack them on the batch dimension of my output vector, 1.

I have modified DataParallel locally so that I can pass in different in_dim and out_dim parameters as opposed to a single dim parameter. Is there any interest in making such a change in the official repository?

Regards,
Stephen

1 Like

I’m afraid that packed sequences aren’t compatible with DataParallel! It should raise an error, because I think that right now it computes sth invalid. A fix would be to do pack_padded_sequence inside the DataParallel part (so instead of wrapping self.lstm you’d wrap a module that does pack + self.lstm + pad).

Hi @apaszke, Thanks for the quick reply.

Sorry for not being clearer, I think I am already doing what you are describing–I have a custom class that inherits from nn.Module, and it is this custom class that I wrap with DataParallel. Then inside the forward() method of my custom class, I do the pack + lstm + unpack steps.

But the problem is that if I pass in a 100x2xF array with input_lengths = [100, 50], what happens is that GPU 1 unpacks the LSTM output into a 100x1xF tensor, and GPU 2 unpacks the LSTM output into a 50x1xF tensor, so that when DataParallel tries to gather them together along the batch dimension, it complains about the sizes not matching up. (Thus the fix in my code to add padding to the output tensors on all GPUs, e.g. in this example I would pad GPU 2’s output to be 100x1xF).

I see. Yeah this solution should work correctly.

@apaszke @stephenrawls

I have a naive question here, in order to make the above forward function work on GPU, the whole model and input have to be moved to GPU by calling .cuda(), right? But since pack_padded_sequence needs a seq lengths param (which is typically a list object if on CPU), so do I have to convert this list param into a cuda array?

I’m having issues when doing this conversion, like LongTensor(input_lengths).cuda() and pass it to pack_padded_sequence.

Hi @ecolss. Yes to your first question, to move to GPU you need to call model = model.cuda() and input = input.cuda().

No to your second question–the pack_padded_sequence() function expects a Python iterable object living on the CPU, so you should not call .cuda() on the input_lengths param.

Stephen

@stephenrawls

OH, yes, I just found the root cause of my problem, thanks

Also, the list of lengths is of the same size as the batches, so you can pass it as an input to the special module, and DataParallel should be able to slice it properly.

@stephenrawls Hello, I did the same trick with variable sequence lengths, thank you for the solution. But the question is about why you don’t place padded_output on GPU using .cuda(). Without this option I can’t get it work.
Also, I wanted to ask you what loss function do you use? I use CrossEntropy with zero weight for 0-th class, because this is class for padding. And it seems to work.
@apaszke Hello, too) Will someone improve pack_padded_sequence() function for multigpu without such tricks? Maybe someone can add good example with variable sequence length rnnlms in multigpu mode. I saw many questions about this.

2 Likes

I have a bit of confusion here.
Can I create a variable at run-time and call .cuda on it while working on multiple GPUS.

For the above code to work on multi GPU

  1. I need to call h_0.cuda(), c_0.cuda() in the forward function?
  2. Can I just call .cuda() on the wcembeds variable which is created during run-time and expect it to work?

From the above discussion my lengths tensor which is not a variable should not be converted to .cuda(). This means
3. inputs —> inputs.cuda()
4. lengths → lengths #no cuda transformation?


5. Do I need to convert my original_indices tensor to .cuda()?

Sorry, for being very naive. I will have to spend some time understanding how stuff works on GPU, but, for the time-being could you guys help?

The above code is for creating character level embeddings for each word in the input. The concatenation of the last hidden vector of the forward and backward RNN gives the resultant embedding for each word.

Also,
self.char_embed = nn.Embedding(self.cvocab_size, self.cembed_dim, padding_idx=self.padding_idx)

I hope I am clear.

@apaszke Hi, do you happen to know the source code that we can reference please?

PyTorch 1.5 has completely fixed this issue . . . seamlessly . . . no more gerrymandering being required. I confirmed this today, in a project I’m working involving bidir GRUS on speech mfccs.