I’m confused to see that two weight checkpoints from the same network can have very different inference times on CPU. I’ve seen cases where one checkpoint makes inference consistently 2X slower than another checkpoint. I have not seen this kind of inconsistency on GPU.
From my investigation, it seems that checkpoints taken earlier in training are faster to do inference with. Does that make sense? Is there some magic optimization tricks which I’m not aware of?
thanks for the nice script to reproduce this issue.
You are seeing the time difference most likely due to denormal values (float values close to zero), which are apparently slower to process.
If you set torch.set_flush_denormal(True), you’ll see that both checkpoints are faster to process and have approx. the same time now.
Thanks for your answer. That’s the kind of obscure quirk I was hoping to learn about. I assume this is not documented. Are there other such undocumented quirks?