I am trying to implement multi-scale training, so I inerpolate my image tensors to multi scales tensors and feed them one by one to the model and call
model.backward() separately for each scale. Then I call
optim.step() only once for all the scales. Also I call
optim.zero_grad() only once before the multi-scaled tensors are fed into the model.
I am not quite understand the mechanism of optimizer. Will the gradients be summed up with each
bachward() called or will it be averaged ? Do I need to reduce my learning rate since the gradients are enlarged by getting summed up ?