Proper weight initialization?


I have a regression model that seems to be highly sensitive to its weight initialization - such that sometimes the model converges while other times it looks like its going to converge and then explodes to nan. My input features are scaled to [-1,1] while outputs are standardized. I am utilizing a tanh activation function and a kaiming_uniform weight initialization scheme. How do I get it such that I don’t need to worry that the model won’t train and require me to retrain. I should also note that I am utilizing a LBFGS optimizer with a lr of 0.8 and hope to maintain the same level of computational performance. Any thoughts?

        if self.bias is not None: