What does PyTorch do when calling tensor.half()

Beinan_Wang · October 18, 2019, 7:13pm

I want to understand how pytorch does fp16 inference. Say I have a pretrained fp32 model and I run fp16 inference by calling model.half(). Then when I run y = model(x), does pytorch simply calculate in fp16 format or are there some hidden optimizations applied to make sure there is not significant drop in accuracy?

albanD · October 18, 2019, 7:48pm

If you only do model.half() and then forward. Pytorch will only convert all the model weights to half precision and then forward with that.
If you want something smarter (that keeps single precision buffers for some ops for stability), you can check out nvidia’s amp package.

Beinan_Wang · October 19, 2019, 3:21am

Thank you for the clarification.

jiacheng1gujiaxin · November 12, 2019, 4:02am

Can nvidia-apex only be used for inference? I want to deploy my model with mix precision, but I find that when I run it, my memory is not reduced.

futhermore, when i run fp16 inference by calling model.half(), the memory is not reduced either

ptrblck · November 12, 2019, 4:20am

No, you can use apex for training and inference.

If you are checking the used memory via nvidia-smi, note that you might see the cached memory as well. torch.cuda.memory_allocated() might give you a better number.

jiacheng1gujiaxin · November 14, 2019, 8:39am

May i ask you how can i deploy my model with fp16 would get significant performance promote.
is tensorrt helpful?

ptrblck · November 14, 2019, 7:00pm

Depending on your model, you might just call model.half() on it.
However, IIRC you could export the vanilla FP32 model and pass it to TensorRT, which should use the specified precision.

Rohith_Rajesh · July 17, 2021, 11:46am

Hi!
Does that mean the activations will also be computed in FP16?