I am trying to implement multi-scale training, so I inerpolate my image tensors to multi scales tensors and feed them one by one to the model and call model.backward()
separately for each scale. Then I call optim.step()
only once for all the scales. Also I call optim.zero_grad()
only once before the multi-scaled tensors are fed into the model.
I am not quite understand the mechanism of optimizer. Will the gradients be summed up with each bachward()
called or will it be averaged ? Do I need to reduce my learning rate since the gradients are enlarged by getting summed up ?