Quantization aware training, extremely slow on GPU

That should not be the case. fake_quantize is supported on the GPU. For more insight, can you compare the run-time per batch with and without quantization aware training on GPU?
Also, can you provide the quantization qconfig that you used for quantization aware training?