I am trying to fine-tune Bart for a classification problem. However, I am getting some weird plots of the gradients. My first impression was an issue with underflow or overflow, but I have tried to clip the norm of the gradients, but it is not helping.
Any suggestion?