I was developing a new model on machines with pytorch 0.4.1 and then shifted the code to machines with 1.0.0. On machines with 1.0.0 the models performed significantly worse. I controlled for changes to default inits in nn.Linear and nn.Conv2d by forcing all layers to initialize using the 0.4.1 defaults. I also pulled the 0.4.1 version of Adam into my codebase so that all models used the same optimizer.
The performance difference seems to be related to numerical stability, particularly in F.log_softmax and nn.CrossEntropy (which I think relies on F.log_softmax). I believe this for the following reasons.
I setup a pair of conda environments on the same machine, which differed only in their pytorch version. In each env I ran two models using the same code in each env. Within a given env, the only difference between models was that one model cast the “logit tensors” to float64 before computing losses based on F.log_softmax and nn.CrossEntropy, and the other model cast them to float32. After computing the losses, I always cast back to float32 before summing the losses and doing the backwards pass. I.e., I ran four models: (0.4.1, float32), (0.4.1, float64), (1.0.0, float32), and (1.0.0, float64). Both 0.4.1 models performed the same across all values I am monitoring during training. The float64 1.0.0 model significantly outperformed the float32 1.0.0 model. Both 0.4.1 models significantly outperformed the float64 1.0.0 model. The gap between the 1.0.0 and 0.4.1 models was BIG.
I am running more tests to see how controlling the magnitude/variance of inputs to F.log_softmax and nn.CrossEntropy differentially affect performance across pytorch versions, and these seem to support my belief that numerical instabilities are the main cause of differences across pytorch versions.
What changes were made to the backend numerical stuff for these functions? Do you have any advice for more conclusively determining what’s causing performance differences across pytorch versions?