GRU and LSTM generation of sequences

Dear pytorch community,

I have several questions about best practice in using recurrent networks in torch for generation of sequences.

The first one, if I want to build decoder net should I use nn.GRU (or nn.LSTM) instead nn.LSTMCell (nn.GRUCell)? From my experience, if I work with LSTMCell the speed of calculations is drammatically lower (up to 100 times) than if I use nn.LSTM. Maybe it is related with cudnn optimisation for LSTM (and GRU) module? Is any way to speedup LSTMCell calculations?

I try to build an autoencoder, that accepts sequences of variable length. My autoencoder looks like:

class SimpleAutoencoder(nn.Module):
def init(self, input_size, hidden_size, n_layers=3):
super(SimpleAutoencoder, self).init()
self.n_layers = n_layers
self.hidden_size = hidden_size
self.gru_encoder = nn.GRU(input_size, hidden_size,n_layers,batch_first=True)
self.gru_decoder = nn.GRU(input_size, hidden_size, n_layers, batch_first=True)
self.h2o = nn.Linear(hidden_size,input_size) # Hidden to output

def encode(self, input):
    output, hidden = self.gru_encoder(input, None)
    return output, hidden

def decode(self, input, hidden):
    output,hidden = self.gru_decoder(input,hidden)
    return output,hidden
def h2o_apply(self,input):
    return self.h2o(input)

My training loop looks like:

one_hot_batch = list(map(lambda x:Variable(torch.FloatTensor(x)),one_hot_batch))

packed_one_hot_batch = pack_padded_sequence(pad_sequence(one_hot_batch,batch_first=True).cuda(),batch_lens, batch_first=True)

_, latent = vae.encode(packed_one_hot_batch)
= vae.decode(packed_one_hot_batch,latent)
packed = pad_packed_sequence(outputs,batch_first=True)

for string,length,index in zip(*packed,range(batch_size)):
decoded_string_without_sos_symbol = vae.h2o_apply(string[1:length])
loss += criterion(decoded_string_without_sos_symbol,
loss /= len(batch)

The training in such manner, as I can understand, is teacher force. Because at the decoding stage the network feeds the real inputs (outputs,_ = vae.decode(packed_one_hot_batch,latent)). But, for my task it leads to the situation when, in the test stage, network can generate sequences very well only if I use the real symbols (as in training mode), but if I feed the output of the previous step, the network generates rubbish (just infinite repetition of one specific symbol).

I tried another one approach. I generated “fake” inputs( just ones), to make the model generate only from the hidden state.

one_hot_batch_fake = list(map(lambda x:torch.ones_like(x).cuda(),one_hot_batch))
packed_one_hot_batch_fake = pack_padded_sequence(
pad_sequence(one_hot_batch_fake, batch_first=True).cuda(), batch_lens, batch_first=True)

_, latent = vae.encode(packed_one_hot_batch)
= vae.decode(packed_one_hot_batch_fake,latent)
packed = pad_packed_sequence(outputs,batch_first=True)

It works, but very inefficiently, the quality of reconstruction is very low. So the second question, what is the right way to generate sequences from latent representation?

I suppose, that good idea is to apply teacher forcing with some probability, but for that, how one can use nn.GRU layer so the output of the previous step should be the input for the next step?