The pytorch tutorials do a great job of illustrating a bare-bones RNN by defining the input and hidden layers, and manually feeding the hidden layers back into the network to remember the state. This flexibility then allows you to very easily perform teacher forcing.
Question 1: How do you perform teacher forcing when using the native nn.RNN() module (since the entire sequence is fed at once)? Example simple RNN network would be:
class SimpleRNN(nn.Module):
def __init__(self, vocab_size,
embedding_dim,
batch_sz,
hidden_size=128,
nlayers=1,
num_directions=1,
dropout=0.1):
super(SimpleRNN, self).__init__()
self.batch_sz = batch_sz
self.hidden_size = hidden_size
self.encoder = nn.Embedding(vocab_size, embedding_dim)
self.rnn = nn.RNN(embedding_dim, hidden_size, nlayers, dropout=0.5)
self.decoder = nn.Linear(hidden_size, vocab_size)
def init_hidden(self):
return autograd.Variable(torch.zeros(nlayers, batch_sz, hidden_size)).cuda()
def forward(self, inputs, hidden):
# -- encoder returns:
# -- [batch_sz, seq_len, embed_dim]
encoded = self.encoder(inputs)
_, seq_len, _ = encoded.size()
# -- rnn returns:
# -- output.size() = [seq_len, batch_sz, hidden_sz]
# -- hidden.size() = [nlayers, batch_sz, hidden_sz]
output, hidden = self.rnn(encoded.view(seq_len, batch_sz, embedding_dim), hidden)
# -- decoder returns:
# -- output.size() = [batch_sz, seq_len, vocab_size]
output = F.log_softmax(decoder(output.view(batch_sz, seq_len, self.hidden_size)))
return output, hidden
Where I can call the network with:
model = SimpleRNN(vocab_size, embedding_dim, batch_sz).cuda()
x_data, y_data = get_sequence_data(train_batches[0])
output, hidden = model(x_data, model.init_hidden())
Just for completeness, here are my shapes of x_data, output, and hidden:
print(x_data.size(), output.size(), hidden.size())
torch.Size([32, 80]) torch.Size([32, 80, 4773]) torch.Size([1, 32, 128])
Question 2: would it be possible to use this SimpleRNN network to then generate a sequence word-by-word, by first feeding it a <GO_TOKEN> and iterating until an <END_TOKEN> is reached? I ask because when I run this:
x_data = autograd.Variable(torch.LongTensor([[word2idx['<GO>']]]), volatile=True).cuda()
output, hidden = model(x_data, model.init_hidden(1))
print(output, output.sum())
I get an output of all 0s, and the output.sum() = 0. I get this even after training the network and backpropagating the loss. Any ideas why?
Question 3: If not terribly inefficient, is it possible to train the SimpleRNN network above word-by-word, analogous to the pytorch tutorial shown here (albeit there they’re training character-by-character).