@albanD Please correct me if I am wrong. it seems like this is because of the way PyTorch’s SGD is different from other frameworks (e.g. caffe)
I am porting a network from caffe and am trying to understand why after/if I increase the lr
(after certain epoch), network always becomes unstable (inf
weights and nan
loss).
~It seems like PyTorch’s SGD is more sensitive to ~lr
changes because it is applied to velocity instead of gradients. Is there any particular reason for this choice?
EDIT: I added a SGD that is more like other frameworks and still if I increase lr
, the network becomes unstable. Decreasing lr
is always fine.