I am using libtorch on Windows from inside my C++ code for running a model in eval mode, everything runs just fine. What I am trying to do now, is to add FP16 support and check the performance boost on various NVidia cards. What I basically do, is that I convert the entire model to FP16, something like
I can run the code just fine and I can see that tensors inside the model are really “HalfTensors”. What surprises me, however, is that I get absolutely the same results (eval times) in eval mode as for FP32 - so there is no performance difference even for cards with tensor cores like RTX20XX.
As there is not much libtorch C++ code out there, has anybody an idea, why I do not see at least a minimal performance difference when running FP16? Does it depend on the model? Most layers I have are Conv2d, BatchNorm etc.
The speedup depends on the model and of course potentially other bottlenecks.
E.g. even if your model gets a speedup, your training might be bottlenecked by another part of the code, which will hide the speedup.
To check for TensorCore usage, you could use PyProf (only Python) or nsys.
Thanks @ptrblck for your prompt response. I am actually using FP16 mode for inference because it is more time-critical. I can see memory footprint reduction (not 2x as expected, but less), but inference time is actually longer (by around 2x). Is this related to the GPU support for mixed-precision?