Lost 7% accuracy of dependency parsing after updating to pytorch 0.4

Hi, all
I implemented Dyer’s stack-lstm dependency parser (with in-house modification) with Pytorch 0.3.0post4 and achieved result (UAS, LAS) that very close to the paper.
Recently, I updated pytorch to 0.4.1 and followed the migration guide it provided.
However, I got following:

  1. UAS dropped almost 7%.
  2. Training time for each epoch is reduced to 50%. (x2 faster)

It’s so weird and checked my code many times for almost 4 days but found nothing suspicious.
I can’t open my code so far cause it’s shared with other people, but I want to write down details of my code here as much as possible.
Model has following modules & features

  1. consists of four 2-layer LSTMs. (no truncate)
  2. in-house Tree-LSTMs.
  3. layer normalization between layers
  4. dropout for all non-recurrent modules
  5. pre-trained embeddings
  6. uniform initialization (e.g. param.data.uniform_(-0.1, 0.1)) except for layer_norms
  7. adadelta optimizer with weight decay
  8. a lot of split, cat operations
  9. KLDiv Loss with label smoothing
  10. computational graph is very dynamic because it’s running batch on trees

What I have done with migration to 0.4.1 is:

  1. Erase all Variables, Volatile expressions & flags.
  2. Change loss.data[0] to loss.items()
  3. Change loss(average=False) to loss(reduction=“sum”)
  4. Change pre-trained embedding loading to nn.Embedding.from_pretrained(torch.load(pre), freeze=True)

My first guess was some disconnections occurred in computational graph, cause training time dramatically reduced. But I can’t find any other signs that happened.
Anyone has clues?

Many thanks.

Did you investigate your guess regarding the disconnections in the computation graph?
You could check, if all parameters have valid gradients using: print(model.layer.weight.grad).

Yes, all layers have grad. But this cannot check whether some intermediate tensors lost connections with graph.