[solved] Has anyone else encountered a unreliably repeatable issue of weights dropping to exactly zero?

I’m training an RL network, which is pretty unstable anyway. But I’ve fixed all the weight initializations, exploration probabilities etc, using random seeds. Sometimes the weights of the network will drop to zero in a single step. It’s not very easy to repeat; it happens systematically, but on different episodes each time; even with the stochasticity’s fixed. It’s probably something stupid I’m doing somewhere, but just wondering if anyone else has encountered something similar?

(there are no nans, just… weights drop to exactly zero, in a single optimization step :open_mouth: I mean, it’ll take several thousand steps, without being zero. then Bam! all zeros )

Are you using CUDA for your training?
If so, did you set torch.backends.cudnn.deterministic = True to get deterministic behavior?
It might help to debug this issue.

Also, would it be possible to post a small code snippet reproducing this error?
I’ve never seen this behavior before.

I am. At least, I tried on cpu, and the issue either occurs on a different episode (or possibly doesnt occur; unclear)

I’ll set deterministic to True, and see what happens.

Unfortunately, I haven’t managed to create a small reproducible snippet, since all my attempts to freeze a single training step that reproduces the error have failed to date, and the surrounding code is fairly complex.

Good heads-up on the deterministic=True. Thanks! :slight_smile:

Well, if I remember correctly, I’ve seen a similar issue before in one of my models.
It might be completely unrelated to your issue, but maybe it could give you some idea.

I’ve created a CNN for a regression task and apparently the training hyper-parameters (lr, momentum) and the weight init were quite bad for the current model, so that after a while the weights were dropped all to zero and just the bias tensor in the last layer determined the output.
This model was thus quite useless, since the output was a constant (mean of all regression outputs).
So the mean error was good, but the input didn’t determine the output.

If I remember correctly, just the weights of the last layer were all zeros, the other weights were still “random”.

Ok. Something to bear in mind. I’m not sure that is exactly my situation though, although it could be; but I have multiple layers, and they are all dropping to zero.

I’m still adding instrumentation, to try to figure out what’s going on.

Something that I just found out, that is curious, though points towards a nan issue at the base, followed by some other unexpected behavior on the top, is that if I take .abs().max() over the weights, this stays non nan, until it drops to zero at some point. However, .abs().sum(), becomes nan at some point. I reckon what’s happening is:

  • at least one weight probably is becoming nan, combined iwth
  • somehow, nan isn’t propagating through .max()

On the other hand, initial experimentation shows that .max() does propagate nans, so there’s something else going on. But I suspect nan is at the base of my actual issue somewhere.

what my code outputs, at the time of crashing out:

weight_abs_max=0.809 nonzeros=2926 absavg=0.127
weights look strange weight_abs_max=0.441 nonzeros=2926 absavg=nan
weights_sum nan
non_zero_weights 2926

The checking code:

abs_max = 0
non_zero_weights = 0
weights_sum = 0
for _p in params:
    abs_max = max(abs_max, _p.abs().max().item())
    _non_zero_weights = _p.nonzero().size()[0]
    non_zero_weights += _non_zero_weights
    weights_sum += _p.abs().sum().item()
weights_avg = weights_sum / non_zero_weights
res_string = f'weight_abs_max={abs_max:.3f} nonzeros={non_zero_weights} absavg={weights_avg:.3f}'
if abs_max < 1e-8 or non_zero_weights == 0 or weights_avg != weights_avg or math.isnan(weights_avg):
    print('weights look strange', res_string)
    print('weights_sum', weights_sum)
    print('non_zero_weights', non_zero_weights)
    if self.dumper is not None:
    raise Exception('weights abs max < 1e-8')

Edit, oh yes, and evidence that, normally .max does in fact propagate nans:

Ah… the built-in Python max doesnt propagate nan. Interesting. I’m going to close this issue now. Thank you for the time you for the heads-up on the deterministic setting. Very useful :slight_smile:


1 Like

(ooohhh, I think I finally figured out where the nans come from. Because I have some tensor where only some of the values I use, and the other values are undefined. Then I multiply this tensor by a binary mask, and add it to some other tensor. But … the undefined values can be nan, simply by random chance. and 0 * nan != 0, it still equals nan :stuck_out_tongue: )

1 Like

It’s good to hear you figured it out!

A bit off-topic, but @rasbt showed some nice way to check for nan values, in case you didn’t use it:

a = torch.from_numpy(np.array([1.0, 2.0, np.nan]))
print(a != a) 
> tensor([ 0,  0,  1], dtype=torch.uint8)