Sorry, it’s a little tough to comment on StackOverflow so I’m going to discuss the answer here.
As for PyTorch, the default for batch_first
is already set to true, from https://pytorch.org/docs/stable/nn.html#gru so this line should be right:
batch_size, sequence_len, hidden_size = output.shape
I totally agree that .view()
is sort of dangerous but it was the fastest hack to modify the shape to what I’ll need.
As for this line:
output = output.contiguous().view(batch_size * sequence_len, hidden_size)
The hidden_size
is what’s declared in the linear initialization, self.classifier = nn.Linear(hidden_size, vocab_size)
which does make sense since the feedforward should be taking the hidden_size
of the GRU as the input and spitting out the vocabulary size for the language modelling task.
My suspicion is when I try to restructure the self.classifer(output)
, I might not be using the right shape and I should be re-permuting to the right dimensions instead.
BTW, I’ve updated the code on https://www.kaggle.com/alvations/gru-language-model-not-training-properly and it looks like with very very careful tuning, I’m able to get the model to generate something sort of meaningful.
hyperparams = Hyperparams(embed_size=250, hidden_size=250, num_layers=1,
loss_func=nn.CrossEntropyLoss,
learning_rate=0.0003, optimizer=optim.Adam, batch_size=200)
dataloader, model, optimizer, criterion = initialize_data_model_optim_loss(hyperparams)
train(5000, dataloader, model, criterion, optimizer)
generate_example(model)
[out]:
the null hypothesis is never true . </s>
But it still befalls me why would that happen when the hyperparams are “suitable” and trained long enough.