Is there any necessary reason PyTorch word language model doesn’t use any optimizer for training the model?
The weights in the source example are updated like:
for p in model.parameters(): p.data.add_(-lr, p.grad.data)
I have tried different optimizers such as Adam, RMSprop,… with different learning rate for updating the weights instead, but all of them result in high loss and perplexity. They can not help the model train better.