Latency difference on two different training

I have a set of PyTorch classification models. I trained them with the same parameters on two different datasets:

  • A first time around 5 months ago with torch==1.9.1
  • A second time now with torch==1.12.1

I tested inference for both models inside the same environment with torch==1.12.1 and somehow the newly trained models have double the latency of the older ones (15ms to 30ms).

It doesn’t seem to be the version with which I train the models as I tested a retrain with version 1.9.1 and got a slower model there too.

Specifically, I looked into one of the models with a combination of 1D convolutional, LSTM, and linear layers. Profiling this model on both trainings I saw that mostly the convolution and lstm operations got much slower. The weights have comparable averages, but the older ones have a bigger standard deviation (2 to 10x).

Another strange thing I noticed, is when retraining on the new dataset for a single epoch I am getting the same latency as the previous training while the code for model training hasn’t been updated between the two.

I there anything I might be missing, or forgot to upgrade when upgrading the torch packages?

I don’t think you’ve missed anything and even if so it wouldn’t explain why retraining in 1.9.1 also now yields the slower results.
In case you are using the CPU for your inference use case, check if torch.set_flush_denormal(True) makes a difference.