I am encountering a problem while training a pretty simple neural network where each layer is a triangular matrix weight matrix (alternating between lower and upper) and a SoftPlus activation applied at every layer, except the last layer. I was trying to see how the sum of the squares of each weight matrix gradient was varying over the number of iterations of training, and it was found to be pretty noisy with a large variance.

I was suspecting vanishing or exploding gradients so I used some solutions like gradient clipping using grad_clip_norm_ and grad_clip_value_ but of no use. Also, I used batch normalization to solve the problem but get more noisy gradients.

I am trying to implement a model for prob.density estimation that is going to map a particular distribution into normal distribution. So my loss is somewhat a negative log-likelihood function which I am trying to minimize.

My custom forward function for every layer looks something like:

Each self.phi is a nn.Softplus() function
Now I basically have 5 or 7 such layers in the network and I try to minimise some log likelihood loss based on self.W (which is non-convex I think) using torch.optim.adam as the optimizer. I try to observe the mean of self.W.grad.abs( ) for every layer as training progresses and it shows a noisy curve which I trying to figure out why.