Deployment in FP16? Calling model.half() only reduces memory by 7%

I’m using the standard ResNet50 from torchvision model with an added 512x128 FC layer. Calling model.half() on it only bring it down from 940mb to 870mb GPU usage. Shouldn’t there be a more significant reduction in memory usage?
I call torch.cuda.empty_cache() after initializing the model. I’ve set torch.backends.cudnn.benchmark = True as well.

Same behavior on Tesla T4 and 2080Ti.


model = torchvision.models.resnet50(
                pretrained=False, progress=False)
model.fc = torch.nn.Sequential(
                torch.nn.Linear(self.model.fc.in_features, 512),
                torch.nn.Linear(512, 128),
                torch.nn.Linear(128, 1)
## section to load weights ## 
if use_fp16:

As far as I know, you should use tensorRT to get the real speedup and reduce gpu memory.