I am trying to implement batched seq2seq model in pytorch, after understanding and implementing the single batch one. However, I am not sure whether my implementation is correct as after few epochs of training all it outputs is the padding character. Specifically, these are the changes I made from the tutorial:
 Input is now a transposed matrix of
sequence x batch_size
, where the each column is a sequence.  Encoder is changed to handle batch inputs like this:
class EncoderRNN(nn.Module):
def __init__(self, input_size, hidden_size,batch_size, n_layers=1):
super(EncoderRNN, self).__init__()
self.n_layers = n_layers
self.hidden_size = hidden_size
self.batch_size = batch_size
self.embedding = nn.Embedding(input_size, hidden_size, padToken)
self.gru = nn.GRU(hidden_size, hidden_size)
def forward(self, input, hidden):
embedded = self.embedding(input).view(1, self.batch_size, 1) # sequence length x batch size x hidden_size
output = embedded
for i in range(self.n_layers):
output, hidden = self.gru(output, hidden)
return output, hidden
def initHidden(self):
result = Variable(torch.zeros(1, self.batch_size, self.hidden_size))
if use_cuda:
return result.cuda()
else:
return result
 I found transferring AttentionDecoder into batched mode to be tricky. After considerable tinkering with matrix sizes, this is what I came up with:
class AttnDecoderRNN(nn.Module):
def __init__(self, hidden_size, output_size, batch_size=5, n_layers=1, dropout_p=0.1, max_length=25):
super(AttnDecoderRNN, self).__init__()
self.hidden_size = hidden_size
self.output_size = output_size
self.n_layers = n_layers
self.dropout_p = dropout_p
self.max_length = max_length
self.batch_size = batch_size
self.embedding = nn.Embedding(self.output_size, self.hidden_size)
self.attn = nn.Linear(self.hidden_size * 2, self.max_length)
self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
self.dropout = nn.Dropout(self.dropout_p)
self.gru = nn.GRU(self.hidden_size, self.hidden_size)
self.out = nn.Linear(self.hidden_size, self.output_size)
def forward(self, input, hidden, encoder_outputs):
embedded = self.embedding(input)
embedded = self.dropout(embedded) # 1xbatch_size x hidden_size
dt = torch.cat((embedded, hidden), 2)
attn_weights = self.attn(dt[0])
attn_weights = F.softmax(attn_weights)
encoder_outputs = encoder_outputs.transpose(0,1)
attn_applied = torch.bmm(attn_weights.unsqueeze(1), # bmm  matmul
encoder_outputs) # should be batch_size x 1 x hidden_size
embedded = embedded.transpose(0,1)
output = torch.cat((embedded, attn_applied), 2) # batch_size x 1 x hidden*2
output = self.attn_combine(output.squeeze(1)).unsqueeze(1) # batch_size x 1 x hidden
output = output.transpose(0,1)
for i in range(self.n_layers):
output = F.relu(output)
output, hidden = self.gru(output, hidden)
output = F.log_softmax(self.out(output[0]))
return output, hidden, attn_weights
def initHidden(self):
result = Variable(torch.zeros(1, self.batch_size, self.hidden_size))
if use_cuda:
return result.cuda()
else:
return result
 I am trying to implement scheduled sampling while training (using teacher forcing with probability 0.5). Teacher forcing is similar to implement when we feed the next real output as input. But when we try to use the generated output as input, I do it like this:
# Without teacher forcing: use its own predictions as the next input
for di in range(max_length):
decoder_output, decoder_hidden, decoder_attention = decoder(
decoder_input, decoder_hidden, encoder_outputs)
topv, topi = decoder_output.data.topk(1)
nis = []
for nit in topi:
ni = nit[0]
nis.append(ni)
decoder_input = Variable(torch.LongTensor([nis]))
decoder_input = decoder_input.cuda() if use_cuda else decoder_input
for ct in range(batch_size):
loss += criterion(decoder_output[ct], target_variables[di][ct])
loss = loss / batch_size
Is my implementation correct till now? Because while training I find the generated output quickly producing only pad tokens and nothing else. I have researched on related implementations but they have one or two limitations for me to follow them.

@spro’s batched seq2seq tutorial

LuongAttnDecoderRNN
is not batched, it runs one step at a time  While the notebook mentions scheduled sampling, only teacher forcing is implemented and not the without part.
 The tutorial used
PackedSequence
(also this discussion), but I could not understand why does it run the encoding step once. Is passing one timestep of each sequence in batch tonn.gru
for each row in sequences is same as passing full timesteps of each sequence once?

It would be great if someone can help me understand this implementation. Also, if I would had used PackedSequence
, how would I had modified the above?