Optimizer.step() is slow than backprop

I am using Adam optimizer to train TD3 for offline RL setting. I saw significant increase in training time. Hence, I ran performance profiling for my code. What’s weird is backprop takes less time than update step. I ensured that I am properly using torch.cuda.synchronize. I’ve attached my code snippet for the same below.

Currently, using batch size of 256 for Hopper env and using pytorch 1.7.1.




Below is my code snippet for critic (temp1, temp2, temp3, temp4 are cuda events):

temp2.record()

critic_loss.backward()

temp3.record()
torch.cuda.synchronize()
critic_backprop_time = temp2.elapsed_time(temp3)
#print("critic_backprop_time", critic_backprop_time)
writer.add_scalar("critic_backprop_time", critic_backprop_time, self.total_it)

temp2.record()

self.critic_optimizer.step()

temp3.record()
torch.cuda.synchronize()
critic_step_time = temp2.elapsed_time(temp3)
#print("critic_step_time", critic_step_time)
writer.add_scalar("critic_step_time", critic_step_time, self.total_it)

Code snippet for actor:

temp4.record()

actor_loss.backward()

temp3.record()
torch.cuda.synchronize()
actor_backprop_time = temp4.elapsed_time(temp3)
#print("actor_backprop_time", actor_backprop_time)
writer.add_scalar("actor_backprop_time", actor_backprop_time, self.total_it)

temp4.record()

self.actor_optimizer.step()

temp3.record()
torch.cuda.synchronize()
actor_step_time = temp4.elapsed_time(temp3)
#print("actor_step_time", actor_step_time)
writer.add_scalar("actor_step_time", actor_step_time, self.total_it)

temp4.record()

Try to update to the latest PyTorch release with the latest CUDA runtime and check the profiles again.
In newer PyTorch releases a few optimizers accept the foreach argument, which could speed up the step() call.
Also, I don’t understand what “for Hopper env” means.

I updated to the latest pytorch version. After toggling foreach argument to True for Adam optimizer, it doesn’t speed up the step() call. The timing profile for step() is almost the same. For backward(), it performs slightly worse, leading to an increase of 2 minutes in total time for 10k iterations.

Hopper env stands for hopper mujoco environment.