Deployment in FP16? Calling model.half() only reduces memory by 7%

I’m using the standard ResNet50 from torchvision model with an added 512x128 FC layer. Calling model.half() on it only bring it down from 940mb to 870mb GPU usage. Shouldn’t there be a more significant reduction in memory usage?
I call torch.cuda.empty_cache() after initializing the model. I’ve set torch.backends.cudnn.benchmark = True as well.

Same behavior on Tesla T4 and 2080Ti.

model:

model = torchvision.models.resnet50(
                pretrained=False, progress=False)
model.fc = torch.nn.Sequential(
                torch.nn.Linear(self.model.fc.in_features, 512),
                torch.nn.Linear(512, 128),
                torch.nn.Linear(128, 1)
            )
## section to load weights ## 
if use_fp16:
    model.half()
model.to(compute_device)
model.eval()
torch.cuda.empty_cache()

As far as I know, you should use tensorRT to get the real speedup and reduce gpu memory.

1 Like