A Unet with very small weigths takes long time for inference on CPU

We have trained a UNet network.
When run on a GPU it takes 6 milliseconds. The same image when run on a CPU it takes 4.5 seconds.
We noticed that the trained network has a lot of very small weights (order of e-42).
After manually setting to 0 all those weights, now the network runs in 290 milliseconds on CPU, with the same performance.
You can see with https://netron.app/ the network here
Any idea how to fix it without the manual setting to 0?

I assume one of the workloads is using the CPU? (You are mentioning GPU in both cases)
Assuming the slower use case is executed on the CPU, you could use torch.set_flush_denormal(True) to disable denormal numbers on your CPU, which should then avoid the (potentially slower) denormal code path.

Yes, the slower case was on CPU (I edited the typo). Using torch.set_flush_denormal(True) solves the issue of the slower inference. Thank you for your support!