Training time too high when tensorflow code converted to pytorch

Recreating new parameters in the forward pass (as done in decoder) wouldn’t make sense as they won’t be trained and their init might also create a performance penalty which could be avoided.
However, I would first recommend to make sure the models are actually the same as described in your cross-post as I’m unsure what the status is of this debugging effort.