.hal() or the use of mixed precision increases model size

When I apply .half() to my model it seems to increase the GPU memory uses when I am looking at it in nvidia-smi. It goes from ~4.5GB to ~6GB

I was under the impression that when I apply .half() that it would shrink my GPU usage allowing me to run more in parallel or at the same time.

I was running some tests to see if .half() would speed up my model (and it does) but it increases GPU memory in both single processing or multiprocessing modes.

def loadModel(model, multiProcessFlag=True, PATH=BASE_MODEL_NAME, half=False):
if multiProcessFlag:
mp.set_start_method(“spawn”, force=True)
model = Net()
model.share_memory()
model.load_state_dict(torch.load(PATH))
model.eval()
model.to(device)
if half:
model.half()
return model

This also increases the model size when I run inference on it using Mixed Precision (if I do not use half())

So the model will load as the smaller model, but when inference is run the first time on it with Mixed Precision (by running “with autocast” ) then the GPU memory increases as well.

So I have to save the model after calling .half()

Then when I reload the new model it will be smaller.

1 Like