Will gradients accumulated or averaged if no optim.zero_grad() is called

I am trying to implement multi-scale training, so I inerpolate my image tensors to multi scales tensors and feed them one by one to the model and call model.backward() separately for each scale. Then I call optim.step() only once for all the scales. Also I call optim.zero_grad() only once before the multi-scaled tensors are fed into the model.

I am not quite understand the mechanism of optimizer. Will the gradients be summed up with each bachward() called or will it be averaged ? Do I need to reduce my learning rate since the gradients are enlarged by getting summed up ?

The gradients get accumulated in subsequent passes, and hence they recommend to call zero_grad() .

Thanks. For more suggestions please, do I need to reduce the learning rate if I follow this multi backward method ?