Divergence between sum of gradients and gradient of sums

r-aristov · July 17, 2018, 12:06am

Hello. Can somebody help me to figure out is it normal behaviour of model or not:

I have a model with GRUCell in it. I’m using it in RL setting, so I’m feeding it input data one sample at a time (no batches, no tensors for sequence, just separate 1xN tensors for input points in loop)

And I have two identical (?) ways of calculating loss:

for i_episode in range(max_episodes):
        sim = Sim()
        sim.run(max_iters, model)
        loss = model.loss()
        loss.backward()
        model.reset_train_data()

        if i_episode % update_episode == 0 and i_episode != 0:
            optimizer.step()
            optimizer.zero_grad()

(That is every training episode I calculate loss across some sim iterations (<=max_iters), then backprop it, accumulating gradients and every update_episode use it in optimizer, zeroing it afterwards.

The other way is this:

loss = torch.tensor([0.0])
for i_episode in range(max_episodes):
        sim = Sim()
        sim.run(max_iters, model)
        loss += model.loss()
        model.reset_train_data()

        if i_episode % update_episode == 0 and i_episode != 0:
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            loss = torch.tensor([0.0])

(Accumulate sum of losses across update_episode episodes, then backpropagate it)

It should give the same result, I suppose, but resulting gradients differs ( (grad0-grad1).abs().max() is 1.00000e-04 * 1.1635).
After 100-200 updates this cause serious divergence in weigths of models trained first and second ways.

It can be be result of rounding erros, but 10^-4 seems to be to much for such kind of error. Also first approach to calculating gradient seems to have poor convergence, while second converges better, but has long autograd graph dependencies, that slows calculations and sometimes causes stack overflows.

Any thoughts?
Thanks!

albanD · July 17, 2018, 9:48am

Hi,

From a quick look I would say it is one of the two:

Your model.reset_train_data actually changes some tensors inplace that are used in the backward pass or has some unexpected side effect.
If update_episode is large(ish) then yes it can be numerical precision errors. It is expected that even the slightest difference will lead to completely different weights after training.

To check that I would:

Make sure the weights are the same before running tests. Even the slightest difference will give different gradients.
Check that it works for update_episode=1
Check what is the error for update_episode=2, if it is already big, then it’s potentially the first. If the error increases when you increase update_episode, then it is most likely the second.

r-aristov · July 17, 2018, 10:46am

Hi! Thanks for your reply!
My reset_train_data is pretty simple - it just creates new lists for storing logprobs of actions, values and rewards and also inits hidden state of GRUCell:

    def reset_train_data(self):
        self.hidden = torch.zeros(1, self.hidden_size)

        self.values = []
        self.logprobs = []
        self.rewards = []

Also reset_train_data is called after each training episode in both cases, so in theory if it affects backprop, it should affect regardless of case.

Max element difference when using update_episode=0 is 0.0, 1.00000e-09 * 1.3970 when update_episode=2, and increases with update_episode growth.

But more important question is why this affects convergence so drastically? Take a look at this graph:

As you can see, two graphs begin to diverge about 1800 episode - and update_episode is only 10 (according to measurements error is about 2e-9)

albanD · July 17, 2018, 10:54am

Given the error and how it changes with update_episode, it looks like numerical precision errors.
Do you fix your random seed? Does this trend of one training and one not training as well is the same for many different random seed?

Even with two different random seeds and the exact same code, especially in an RL setting, you can have wildly different behaviours unfortunately.

r-aristov · July 17, 2018, 12:20pm

You’re right, when using random seed, result is really unpredictable - sometimes sum-of-grads converges, sometimes - grad-of-sums. So, it must be some weird combination of precision erros and weight initialization causing this effect on my fixed seed…

BTW, either this environment turns out to be much harder for RL than I expected or I have some sneaky bug here. I developed “snake”-like sim - 40x40 squares field with N random wall blocks (and walls on the border), M “apples” and 3-segment snake, controlled by neural net. Every time step it receives 3x8 vectors of distances in 8 directions to nearest wall, “apple”, and self segment (24 distances total). And it can not properly learn, even when I disable grow-on-eating.
Best result I’ve got so far - snake learns to avoid walls, but it is not crazy about “apples”. When I add 25th input to net, representing “satiation” (init it with value 100, every apple adds +100 of it, every step decrements it), training fails completely.
Well, not completely - if I enable “die-on-satiation-0”, then it fails. If I just penalize net with reward -1 for each step on satiation=0, it has amazing effect: while still learning to avoid walls (most death caused by collision), total reward slowly rises. But when it learns to live more than 100 iterations, it begin to receive enormous penalties for “starvations” and total reward drops to negative values. And again - this all happens with disabled grow-on-eating!

(blue is iterations till death, orange is total reward, graphs diverge at value about 100, when snake learns to live long anough to starve)

It is really confuisng and I am still trying to figure out what should I tune in such cases (this RL task seemed very easy to me, I supposed even non-recurrent net should learn optimal behavior in ~1000-5000 episodes…)
I am using the same actor-critic code from pytorch actor-critic example, so there should not be any bugs there.

albanD · July 17, 2018, 12:44pm

Hi,

I know that these kind of applications tend to have very noisy behaviour. But I am not an expert in RL so not sure about your task in particular

r-aristov · July 17, 2018, 12:49pm

Thanks for your help, it was great advice to check behavior on different seeds!

Probably I’ll commit my code to repo and post question about convergence with a link to it on this forum later, may be someone who is interested in RL would look into it.