Latency difference on two different training

antoine_mlbry · November 22, 2022, 5:17pm

I have a set of PyTorch classification models. I trained them with the same parameters on two different datasets:

A first time around 5 months ago with torch==1.9.1
A second time now with torch==1.12.1

I tested inference for both models inside the same environment with torch==1.12.1 and somehow the newly trained models have double the latency of the older ones (15ms to 30ms).

It doesn’t seem to be the version with which I train the models as I tested a retrain with version 1.9.1 and got a slower model there too.

Specifically, I looked into one of the models with a combination of 1D convolutional, LSTM, and linear layers. Profiling this model on both trainings I saw that mostly the convolution and lstm operations got much slower. The weights have comparable averages, but the older ones have a bigger standard deviation (2 to 10x).

Another strange thing I noticed, is when retraining on the new dataset for a single epoch I am getting the same latency as the previous training while the code for model training hasn’t been updated between the two.

I there anything I might be missing, or forgot to upgrade when upgrading the torch packages?

ptrblck · November 23, 2022, 6:53am

I don’t think you’ve missed anything and even if so it wouldn’t explain why retraining in 1.9.1 also now yields the slower results.
In case you are using the CPU for your inference use case, check if torch.set_flush_denormal(True) makes a difference.

antoine_mlbry · December 12, 2022, 4:16pm

Thank you for your answer @ptrblck! I did try using torch.set_flush_denormal(True) but it was not the cause here.
The cause was the way we were saving and processing the data for training. We used to preprocess our data directly within python before training which by default will use the type int64. We switched to using Spark to preprocess our data and by default Spark will save integers as int32. Somehow, even though we were converting the int32 back to LongTensor before training this resulted in longer inference time. The root cause here is still kind of a mystery to me but made it work by forcing Spark to use LongType.