I am interested in implementing Exponential Moving Average that would allow running `backward()`

on it, in such way that it could be applied to tensors with substantial graphs creating them.

The straightforward implementations create an expanding graph that includes all graphs that create the past versions of the averaged tensor, and running `backward()`

quickly runs out of memory.

I understand that this has already been implemented in batch normalization where the means and variances of the inputs to the layer are kept as exponential moving averages. However, it is written in C++ and I could not dig out the implementation in the code.

I think the idea is to keep all the gradients of the moving average tensor w.r. to all the leaves together with the moving average (rather than with the leaf tensors); this would allow easy calculation of the gradients together with the moving average update.

However, the way of implementing it in python that I see, is tricky and inefficient, by subclassing `Tensor`

to keep all the leaf gradients with it, and overriding `Tensor.backward()`

.

Any better idea?

Batch normalization uses buffers for running_mean, running_var, i.e. theyâ€™re updated without autograd.

If youâ€™re talking about trainable â€śmomentumâ€ť, with one step per batch, I think youâ€™d have to approach this as hyperparameter optimization / meta-learning task.

@googlebot Thanks for your response.

I do not quite understand you here:

Batch normalization uses buffers for running_mean, running_var, i.e. theyâ€™re updated without autograd.

running_mean and running_var are used in calculating the layer output, and for this reason `backward()`

needs their gradients.

As for my purpose, I want to use Exponential Moving Averages in my loss function similarly to the way they are used to calculate the layer outputs in Batch normalization.

I looked up the original Batch Normalization paper, and its authors mention moving averages in passing, for the purpose of tracking the accuracy of the model. They suggest using only the batch mean and variance for producing the layer output.

However, the torch implemention indicates that the mean and the variance are estimated using moving averages.

Are you saying that the BN code does not keep the gradients of the moving average buffers, and uses only the current batch to calculate the derivatives of the mean and average?

Did I get this correctly?

This would add random noise to the output of `backward()`

. Interesting. This requires batches to be large enough for the noise not to be too large.

These are calibration values, estimating long-term moments of layer inputs. If you make them trainable, it is no longer a normalization layer.

And to train a weighted averaging coefficient you need a sequence of at least two values in one batch.

this statement is wrong, thatâ€™s not how reverse mode autodiff works.

1 Like