Ok. Something to bear in mind. I’m not sure that is exactly my situation though, although it could be; but I have multiple layers, and they are all dropping to zero.
I’m still adding instrumentation, to try to figure out what’s going on.
Something that I just found out, that is curious, though points towards a nan issue at the base, followed by some other unexpected behavior on the top, is that if I take .abs().max()
over the weights, this stays non nan, until it drops to zero at some point. However, .abs().sum()
, becomes nan at some point. I reckon what’s happening is:
- at least one weight probably is becoming
nan
, combined iwth
- somehow,
nan
isn’t propagating through .max()
On the other hand, initial experimentation shows that .max()
does propagate nans, so there’s something else going on. But I suspect nan is at the base of my actual issue somewhere.
what my code outputs, at the time of crashing out:
weight_abs_max=0.809 nonzeros=2926 absavg=0.127
...
weights look strange weight_abs_max=0.441 nonzeros=2926 absavg=nan
weights_sum nan
non_zero_weights 2926
The checking code:
abs_max = 0
non_zero_weights = 0
weights_sum = 0
for _p in params:
abs_max = max(abs_max, _p.abs().max().item())
_non_zero_weights = _p.nonzero().size()[0]
non_zero_weights += _non_zero_weights
weights_sum += _p.abs().sum().item()
weights_avg = weights_sum / non_zero_weights
res_string = f'weight_abs_max={abs_max:.3f} nonzeros={non_zero_weights} absavg={weights_avg:.3f}'
if abs_max < 1e-8 or non_zero_weights == 0 or weights_avg != weights_avg or math.isnan(weights_avg):
print('weights look strange', res_string)
print('weights_sum', weights_sum)
print('non_zero_weights', non_zero_weights)
if self.dumper is not None:
self.dumper.dump(objects_to_dump)
raise Exception('weights abs max < 1e-8')
Edit, oh yes, and evidence that, normally .max does in fact propagate nans: