High Quantization-Aware training memory consumption

Hi @Georgios_Georgiadis, one known problem is that fake_quantize modules are currently implemented as additional nodes in the computation graph, so their additional outputs (the fake_quantized versions weights and activations) contribute to the memory overhead during training. We have plans to improve the memory overhead in the future by adding fused fake_quant kernels for common layers such as conv and linear.