I have a training loop which will be running multiple forward passes for gradient accumulation; are there any special measures required for parametrizations (e.g. orthogonal) to ensure identical behaviour between a single forward pass at larger batch size vs multiple forward passes at smaller batch size?
For orthogonal I don’t think it should make a difference, but I wouldn’t be surprised if other parameterisations (especially custom ones) would behave differently when forward passed multiple times before performing an optimizer step.
I’m asking because I’m unsure how pytorch handles the parametrization process, especially when compared to something like Tensorflow Addons which will absolutely perform the process multiple times even if the parameters remain unchanged.
I haven’t tested this, but I believe that, with the exception of expected
floating-point round-off error, gradient accumulation will give you the
same result as a single larger batch even with parametrizations.
As I understand it, parametrizations are just an “automatic” framework
for interposing your parametrization function between the actual
trainable parameter and the rest of the network and behave no
differently, including with gradient accumulation, than if you had put
in the parametrization function by hand.
Unfortunately we’ve run into a second problem in that when caching is enabled the gradients are not retained between backward calls. So we’ve switched to manually computing the Q matrix every forward, although this is significantly slower.
One potential thing that comes to mind (at least for me), is to see if using torch.func.vmap (with the chunk_size option) would get consistent behavior for varying batch sizes, but if the gradient accumulations emerge strictly from OOM issues of a single forward pass (from storing multiple forward passes), this may not be practical.