SGD with Nesterov leads to gradient explosion

jia_lee · April 4, 2019, 3:08pm

Hi everyone, I found my network (a modified HourglassNetwork) is difficult to train. If I use naive SGD optimizer, the training process is okay. But the gradient will explode if I set Nesterov=True. And I set the learning rate=1e-4.

It is strange that the SGD with Nesterov is superior to naive SGD. Is there any idea why my gradient exploded?