Try to update to the latest PyTorch release with the latest CUDA runtime and check the profiles again.
In newer PyTorch releases a few optimizers accept the foreach
argument, which could speed up the step()
call.
Also, I don’t understand what “for Hopper env” means.