It usually uses 0.001 when learning a network using Adam, but the appropriate learning rate for SGD is all different (even set the learning rate to 30 !! when learning LSTM for PTB dataset, Language modeling task)
If so, is there a specific criterion or tendency to set the learning rate of SGD?
(For example, it depends on the type of networks or tasks.)
Or should I rely entirely on the user’s experience?
It’s user experience and depends on fields and problems to solve. There are some methods that suggest a lr. In my experience SGD’s is higher than adam’s