Batched seq2seq in pytorch

I am trying to implement batched seq2seq model in pytorch, after understanding and implementing the single batch one. However, I am not sure whether my implementation is correct as after few epochs of training all it outputs is the padding character. Specifically, these are the changes I made from the tutorial:

  • Input is now a transposed matrix of sequence x batch_size, where the each column is a sequence.
  • Encoder is changed to handle batch inputs like this:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size,batch_size, n_layers=1):
        super(EncoderRNN, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size
        self.batch_size = batch_size
        self.embedding = nn.Embedding(input_size, hidden_size, padToken)
        self.gru = nn.GRU(hidden_size, hidden_size)

    def forward(self, input, hidden):
        embedded = self.embedding(input).view(1, self.batch_size, -1) # sequence length x batch size x hidden_size
        output = embedded
        for i in range(self.n_layers):
            output, hidden = self.gru(output, hidden)
        return output, hidden

    def initHidden(self):
        result = Variable(torch.zeros(1, self.batch_size, self.hidden_size))
        if use_cuda:
            return result.cuda()
            return result
  • I found transferring AttentionDecoder into batched mode to be tricky. After considerable tinkering with matrix sizes, this is what I came up with:
class AttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, batch_size=5, n_layers=1, dropout_p=0.1, max_length=25):
        super(AttnDecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers
        self.dropout_p = dropout_p
        self.max_length = max_length
        self.batch_size = batch_size

        self.embedding = nn.Embedding(self.output_size, self.hidden_size)
        self.attn = nn.Linear(self.hidden_size * 2, self.max_length)
        self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
        self.dropout = nn.Dropout(self.dropout_p)
        self.gru = nn.GRU(self.hidden_size, self.hidden_size)
        self.out = nn.Linear(self.hidden_size, self.output_size)

    def forward(self, input, hidden, encoder_outputs):
        embedded = self.embedding(input)
        embedded = self.dropout(embedded) # 1xbatch_size x hidden_size
        dt =, hidden), 2)
        attn_weights = self.attn(dt[0])
        attn_weights = F.softmax(attn_weights)
        encoder_outputs = encoder_outputs.transpose(0,1)
        attn_applied = torch.bmm(attn_weights.unsqueeze(1), # bmm - matmul
                                 encoder_outputs) # should be batch_size x 1 x hidden_size

        embedded = embedded.transpose(0,1)
        output =, attn_applied), 2) # batch_size x 1 x hidden*2
        output = self.attn_combine(output.squeeze(1)).unsqueeze(1) # batch_size x 1 x hidden
        output = output.transpose(0,1)
        for i in range(self.n_layers):
            output = F.relu(output)
            output, hidden = self.gru(output, hidden)

        output = F.log_softmax(self.out(output[0]))
        return output, hidden, attn_weights

    def initHidden(self):
        result = Variable(torch.zeros(1, self.batch_size, self.hidden_size))
        if use_cuda:
            return result.cuda()
            return result
  • I am trying to implement scheduled sampling while training (using teacher forcing with probability 0.5). Teacher forcing is similar to implement when we feed the next real output as input. But when we try to use the generated output as input, I do it like this:
# Without teacher forcing: use its own predictions as the next input
        for di in range(max_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs)
            topv, topi =
            nis = []
            for nit in topi:
                ni = nit[0]

            decoder_input = Variable(torch.LongTensor([nis]))
            decoder_input = decoder_input.cuda() if use_cuda else decoder_input
            for ct in range(batch_size):
                loss += criterion(decoder_output[ct], target_variables[di][ct])
            loss = loss / batch_size

Is my implementation correct till now? Because while training I find the generated output quickly producing only pad tokens and nothing else. I have researched on related implementations but they have one or two limitations for me to follow them.

  • @spro’s batched seq2seq tutorial
    • LuongAttnDecoderRNN is not batched, it runs one step at a time
    • While the notebook mentions scheduled sampling, only teacher forcing is implemented and not the without part.
    • The tutorial used PackedSequence (also this discussion), but I could not understand why does it run the encoding step once. Is passing one timestep of each sequence in batch to nn.gru for each row in sequences is same as passing full timesteps of each sequence once?

It would be great if someone can help me understand this implementation. Also, if I would had used PackedSequence, how would I had modified the above?


@koustuvsinha have you figured out if this implementation is correct? I am also running into the same issue as you while attempting to implement the schedules sampling method. If you’ve figured this out, do you have any good resources you refer to?

The implementation looks incorrect to me. The main problem seems to be that the loss treats the padded target sequence always as the correct ones and try to learn to predict the pad (which, in the best case, seems at least not very useful); this may be related to the behavior you have observed during test.

Personally, I think the problem can be tackled by optimizing a masked_loss, Pytorch doesn’t seem to have native one yet, so may require some workaround. This discussion seems related How can i compute seq2seq loss using mask?

As far as how to obtain the mask, I’m not sure if this has been addressed, or if there is a better/easier way, but what I’ve been doing is something like:

cumsum = torch.Tensor(N, K).fill_(1).cumsum(-1) - 1
mask = cumsum < lens.view(N, 1).expand_as(cumsum)

Not sure if there is a better/easier/more standard way of creating the mask?