I am using libtorch on Windows from inside my C++ code for running a model in eval mode, everything runs just fine. What I am trying to do now, is to add FP16 support and check the performance boost on various NVidia cards. What I basically do, is that I convert the entire model to FP16, something like
and also the input data:
inputgpu = torch::autograd::make_variable(inputcpu, false).to(device, torch::kHalf);
I can run the code just fine and I can see that tensors inside the model are really “HalfTensors”. What surprises me, however, is that I get absolutely the same results (eval times) in eval mode as for FP32 - so there is no performance difference even for cards with tensor cores like RTX20XX.
As there is not much libtorch C++ code out there, has anybody an idea, why I do not see at least a minimal performance difference when running FP16? Does it depend on the model? Most layers I have are Conv2d, BatchNorm etc.