FP16 or FP32 accumulate during AMP forward?

When GPU training with automatic mixed precision, does forward pass use FP16, or FP32 accumulate? I’m asking this, because cheap GPUs have FP32 accumulate tensor core performance intentionally halved and NVidia claims that FP16 accumulate is okay for inference, but not for training, so I’m guessing, good enough for forward, but not backward pass. Would like to better assess whether to get something like Titan RTX, or like 3090.

FP32 accumulate is used during mixed-precision training.

1 Like

Thanks :pray:
What about inference?

Inference can also be used with amp and would also apply the FP32 accumulation.
You could try to use FP16 accumulations using custom CUDA extensions and check if you see a speedup.

custom CUDA extensions = ?

Googled for any existing solutions for PyTorch, could only find some mentions of FP16 for inference in TensorFlow. Here https://blog.tensorflow.org/2019/06/high-performance-inference-with-TensorRT.html and https://www.slideshare.net/VadimSolovey/gan-training-with-tensorflow-and-tensor-cores for example.