It seems your code tries to calculate the gradients in the second backward
pass using “stale” intermediate forward activations, since the parameters were already updated, which is wrong. This post explains it in more detail.
It seems your code tries to calculate the gradients in the second backward
pass using “stale” intermediate forward activations, since the parameters were already updated, which is wrong. This post explains it in more detail.