Optimizer.step() is slow than backprop

Try to update to the latest PyTorch release with the latest CUDA runtime and check the profiles again.
In newer PyTorch releases a few optimizers accept the foreach argument, which could speed up the step() call.
Also, I don’t understand what “for Hopper env” means.