I want to make inference at 16 bit precision (both for model parameters and input data). For example, I wish to convert numbers such as 1.123456789 to number with lower precision (1.123300000 for example)
I wanted to ask if the following approach is correct for reducing the linear weights in a model to 16 bit:
for layer in net_copy.modules():
if type(layer) == nn.Linear:
layer.weight = nn.Parameter(layer.weight.half().float())
Pytorch won’t work with half precision weights so I converted to half and back to float again. When I print the weights it’s hard for me to determine if this approach is behaving as I want it to, plus the performance of the network is identical before and after this transformation.
What kind of issue are you seeing? The current master and nightly binaries support mixed precision training as described here, so you might take a look at it, if you are using a GPU.
This is expected, as all operations would still be performed in FP32. FP16 operations, such as matrix multiplications and convolutions, can be accelerated on the GPU using TensorCores. If you just round the values, you wouldn’t get any performance gains.
I validated that the rounding is working properly by printing in binary, so I’m OK with that. I would just expect performance to deteriorate because of the lost precision. Maybe it is very very insignificant…
I might have misunderstood what you meant by “performance”.
Note that I was talking about the computation speedup using FP16 values on the GPU, while you seem to mean the model accuracy?
Yes, I’m not that concerned with speed.
Does pytorch support INT16 quantization-awre-training and static quantization, like INT8?
I don’t think so based on this blog post and these docs.