I definitively not found anything about this point in the different places of information.
I do use the torch.cuda.amp.autocast imitating trick, with a revelant 6x performance boost (as expected in the NVIDIA documentation, thank’s to Tensor Cores!), but because of the topology of my model, I really need a gradient scaling of my loss function.
Has anyone ever faced this problem? Is there a shareable solution or should I think about a custom implementation?
I’ve done the C++ implementation of the GradScaler class following the python implementation.
You can find the gradscaler.hpp and is gradscaler_test.hpp files in my github gist:
Unfortunatly, this is not tested on multi-GPU systems, but pass the test on single GPU without problem.
Just be carefull with the scale() method that support both Tensor or a iterable of Tensors in a generic way. The method support all the std-like containers that support the back insertion (std::back_inserter) for build N-dimensional output recursively.
For a better support, I use all the c10 namespace tools when possible (optional, variant, …).