Regarding optimizer.zero_grad

Hi,

there is not hard rule of when you use it. You should just make sure that the gradients accumulated when you call optimizer.step() is the one you want.
In general, you want to zero_grad() just before the backward.
For more general problems, you can check this thread that discuss this at length: Why do we need to set the gradients manually to zero in pytorch?