What does PyTorch do when calling tensor.half()

I want to understand how pytorch does fp16 inference. Say I have a pretrained fp32 model and I run fp16 inference by calling model.half(). Then when I run y = model(x), does pytorch simply calculate in fp16 format or are there some hidden optimizations applied to make sure there is not significant drop in accuracy?

1 Like

If you only do model.half() and then forward. Pytorch will only convert all the model weights to half precision and then forward with that.
If you want something smarter (that keeps single precision buffers for some ops for stability), you can check out nvidia’s amp package.

1 Like

Thank you for the clarification.

Can nvidia-apex only be used for inference? I want to deploy my model with mix precision, but I find that when I run it, my memory is not reduced.

futhermore, when i run fp16 inference by calling model.half(), the memory is not reduced either

No, you can use apex for training and inference.

If you are checking the used memory via nvidia-smi, note that you might see the cached memory as well. torch.cuda.memory_allocated() might give you a better number.


May i ask you how can i deploy my model with fp16 would get significant performance promote.
is tensorrt helpful?

Depending on your model, you might just call model.half() on it.
However, IIRC you could export the vanilla FP32 model and pass it to TensorRT, which should use the specified precision.

1 Like

Does that mean the activations will also be computed in FP16?