I’m confused to see that two weight checkpoints from the same network can have very different inference times on CPU. I’ve seen cases where one checkpoint makes inference consistently 2X slower than another checkpoint. I have not seen this kind of inconsistency on GPU.
I have prepared a minimal working example for which w2.pth makes inference 1.5X slower than w1.pth.
From my investigation, it seems that checkpoints taken earlier in training are faster to do inference with. Does that make sense? Is there some magic optimization tricks which I’m not aware of?
All the best,
thanks for the nice script to reproduce this issue.
You are seeing the time difference most likely due to denormal values (float values close to zero), which are apparently slower to process.
If you set
torch.set_flush_denormal(True), you’ll see that both checkpoints are faster to process and have approx. the same time now.
Thanks for your answer. That’s the kind of obscure quirk I was hoping to learn about. I assume this is not documented. Are there other such undocumented quirks?
I’m not sure if it’s documented and I know of its existence, because I also tried to debug this issue for some time.
That’s hard to tell, but please post anything you find suspicious.