Libtorch FP16 question

Alex1 · March 4, 2019, 11:41am

Hi all,

I am using libtorch on Windows from inside my C++ code for running a model in eval mode, everything runs just fine. What I am trying to do now, is to add FP16 support and check the performance boost on various NVidia cards. What I basically do, is that I convert the entire model to FP16, something like

network->to(device, torch::kHalf);

and also the input data:

inputgpu = torch::autograd::make_variable(inputcpu, false).to(device, torch::kHalf);

I can run the code just fine and I can see that tensors inside the model are really “HalfTensors”. What surprises me, however, is that I get absolutely the same results (eval times) in eval mode as for FP32 - so there is no performance difference even for cards with tensor cores like RTX20XX.

As there is not much libtorch C++ code out there, has anybody an idea, why I do not see at least a minimal performance difference when running FP16? Does it depend on the model? Most layers I have are Conv2d, BatchNorm etc.

Thanks.
A.

oelgendy · June 11, 2020, 10:59pm

Hi Alex,

Did you find any answer for your question? I am facing the same issue right now

Thanks,
-Omar

ptrblck · June 12, 2020, 8:49am

The speedup depends on the model and of course potentially other bottlenecks.
E.g. even if your model gets a speedup, your training might be bottlenecked by another part of the code, which will hide the speedup.

To check for TensorCore usage, you could use PyProf (only Python) or nsys.

oelgendy · June 13, 2020, 1:42am

Thanks @ptrblck for your prompt response. I am actually using FP16 mode for inference because it is more time-critical. I can see memory footprint reduction (not 2x as expected, but less), but inference time is actually longer (by around 2x). Is this related to the GPU support for mixed-precision?

ptrblck · June 13, 2020, 5:57am

This shouldn’t be the case. Could you post the model definition so that we could have a look?

oelgendy · June 13, 2020, 8:53pm

It is a standard U-Net architecture comprising conv2d, ConvTranspose2d, Inplace Relu and MaxPool2d for downsampling.

oelgendy · June 15, 2020, 8:11pm

It is basically the implementation in this github page

github.com

mateuszbuda/brain-segmentation-pytorch/blob/master/unet.py

from collections import OrderedDict

import torch
import torch.nn as nn


class UNet(nn.Module):

    def __init__(self, in_channels=3, out_channels=1, init_features=32):
        super(UNet, self).__init__()

        features = init_features
        self.encoder1 = UNet._block(in_channels, features, name="enc1")
        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.encoder2 = UNet._block(features, features * 2, name="enc2")
        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.encoder3 = UNet._block(features * 2, features * 4, name="enc3")
        self.pool3 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.encoder4 = UNet._block(features * 4, features * 8, name="enc4")
        self.pool4 = nn.MaxPool2d(kernel_size=2, stride=2)

This file has been truncated. show original