Libtorch FP16 question

Hi all,

I am using libtorch on Windows from inside my C++ code for running a model in eval mode, everything runs just fine. What I am trying to do now, is to add FP16 support and check the performance boost on various NVidia cards. What I basically do, is that I convert the entire model to FP16, something like

network->to(device, torch::kHalf);

and also the input data:

inputgpu = torch::autograd::make_variable(inputcpu, false).to(device, torch::kHalf);

I can run the code just fine and I can see that tensors inside the model are really “HalfTensors”. What surprises me, however, is that I get absolutely the same results (eval times) in eval mode as for FP32 - so there is no performance difference even for cards with tensor cores like RTX20XX.

As there is not much libtorch C++ code out there, has anybody an idea, why I do not see at least a minimal performance difference when running FP16? Does it depend on the model? Most layers I have are Conv2d, BatchNorm etc.

Thanks.
A.

Hi Alex,

Did you find any answer for your question? I am facing the same issue right now

Thanks,
-Omar

The speedup depends on the model and of course potentially other bottlenecks.
E.g. even if your model gets a speedup, your training might be bottlenecked by another part of the code, which will hide the speedup.

To check for TensorCore usage, you could use PyProf (only Python) or nsys.

2 Likes

Thanks @ptrblck for your prompt response. I am actually using FP16 mode for inference because it is more time-critical. I can see memory footprint reduction (not 2x as expected, but less), but inference time is actually longer (by around 2x). Is this related to the GPU support for mixed-precision?

1 Like

This shouldn’t be the case. Could you post the model definition so that we could have a look?

It is a standard U-Net architecture comprising conv2d, ConvTranspose2d, Inplace Relu and MaxPool2d for downsampling.

It is basically the implementation in this github page