Will quantization be supported for GPUs anytime soon? I have a project where evaluation speed is a very major concern and would love to use quantization to speed it up.
I see the CPU quantization tutorial on the docs was written about 6 months ago, so I am really just curious if this is on the developers’ radar at all and if we can expect this eventually or in the near future.
is there a particular reason it is not a high priority? i am still a student but was under the impression that inference with large models was typically done on GPUs, and quantization would be very beneficial
I’m not sure if there is a voting process but we (as a company) are using pytorch in our production process and inference speed of our custom BERT model is critical for us. In my opinion to get more adoption of pytorch in production and commercial applications inference speed is going to be critical and this feature would be a huge step forward for that. My two cents.
I am looking to contribute some work in the area of quantization for multiple architectures including fpgas and gpus. Any suggested guides on how to get started with contributing to pytorch? Cheers!