Applying applying nn.DataParallel gives wrong size matrix error from cuBLAS when calling forward method

I have refactored the sequence to sequence tutorial for pytorch and am attempting to use nn.DataParallel to run the data on my two GPU’s.

I run into a problem when I attempt to pass the encoder output to the decoder during training. For some reason when I pass the encoder output into the decoder, the tensor dimensions change and lead to a wrong size matrix error.

Below is the section of my training module where everything goes wrong.

encoder_hidden = encoder.module.initHidden()

encoder_optimizer.zero_grad()
decoder_optimizer.zero_grad()

input_length = input_tensor.size(0)
target_length = target_tensor.size(0)

encoder_outputs = torch.zeros(max_length, encoder.module.hidden_size, device=util.device())
loss = 0

for ei in range(input_length):

    (encoder_output, encoder_hidden) = encoder(input_tensor[ei], encoder_hidden)

    encoder_outputs[ei] = encoder_output[0, 0]

decoder_input = torch.tensor([[util.SOS_TOKEN]], device=util.device())

decoder_hidden = encoder_hidden

print('decoder_hidden.unsqueeze(0).size() (after encoding batches): ', decoder_hidden.unsqueeze(0).size())
print('encoder_outputs.unsqueeze(0).size() (after encoding batches): ', encoder_outputs.unsqueeze(0).size())

use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

if use_teacher_forcing:

    for di in range(target_length):

        (decoder_output, decoder_hidden, decoder_attention) = decoder(decoder_input, decoder_hidden, encoder_outputs)

The print statements prior to the for loop give the correct dimensions of the matrices. Then when I call forward and print the same statement as the first expression of the method, the dimensions of the matrices change and give me a dimension error for a bmm operation. This makes me believe that something in the pytorch framework is changing these dimensions and not me. I do not know why though.

Here is the AttnDecoderRNN class

class AttnDecoderRNN(nn.Module):

def __init__(self, hidden_size, output_size, dropout_p=0.1):
    super(AttnDecoderRNN, self).__init__()
    self.hidden_size = hidden_size
    self.output_size = output_size
    self.dropout_p = dropout_p
    self.max_length = util.MAX_LENGTH

    self.embedding = nn.Embedding(self.output_size, self.hidden_size)
    self.attn = nn.Linear(self.hidden_size * 2, self.max_length)
    self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
    self.dropout = nn.Dropout(self.dropout_p)
    self.gru = nn.GRU(self.hidden_size, self.hidden_size)
    self.out = nn.Linear(self.hidden_size, self.output_size)

def forward(self, input, hidden, encoder_outputs):
    print('encoder_outputs.unsqueeze.size (AttnDecoderRNN.forward check 1): ',encoder_outputs.unsqueeze(0).size())
    embedded = self.embedding(input).view(1, 1, -1)
    embedded = self.dropout(embedded)
    

    attn_weights = F.softmax(
        self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1
        )

    print('attn_weights.unsqueeze.size: ', attn_weights.unsqueeze(0).size())
    print('encoder_outputs.unsqueeze.size (AttnDecoderRNN.forward check 2): ', encoder_outputs.unsqueeze(0).size())

    attn_applied = torch.bmm(
        attn_weights.unsqueeze(0),
        encoder_outputs.unsqueeze(0)
        )

    output = torch.cat((embedded[0], attn_applied[0]), 1)
    output = self.attn_combine(output).unsqueeze(0)

    output = F.relu(output)
    (output, hidden) = self.gru(output, hidden)

    output = F.log_softmax(self.out(output[0]), dim=1)

    return(output, hidden, attn_weights)

Below is the main training method where the nn.DataParallel module is applied to my encoder and decoders.

encoder = model.EncoderRNN(input_lang.n_words, util.hidden_size)
decoder = model.AttnDecoderRNN(util.hidden_size, output_lang.n_words, dropout_p=0.1)

encoder = encoder.cuda()
decoder = decoder.cuda()

if torch.cuda.is_available():
    if torch.cuda.device_count() > 1:
        print('TRUTH BOMB')
        encoder = nn.DataParallel(encoder, device_ids=util.devices)
        decoder = nn.DataParallel(decoder, device_ids=util.devices)

I must be applying the DataParallel module incorrectly. Any help would be greatly appreciated.

E: The dimension of the encoder_outputs is [1, 10, 256] after the encoding batches but when I print it in forward after passing the values to the decoder it has dimensions [1, 5, 256]. I am apparently losing half of my elements. After doing some digging into the pytorch code, does this error arise from the _worker function within the parallel_apply function truncating my tensor because I am passing an array of tensors without the array itself being on the devices? Or is failing to join these threads?

E2: I tried forcing the dimensions using view both before and after the forward method call from the AttnDecoderRNN and this resulted in runtime errors from the TensorShape.cpp module which involve chunking. This low level error seems to me that I have DataParallel set up incorrectly and the python modules/backend are not interacting properly because of it.