How should I interpret these gradient histograms?

Hi. I’m currently working on a personal implementation of the Transformer architecture. The code I’ve written as here.

The problem that I’m facing is that I believe my model isn’t training properly and I’m not sure what kind of measures I should take to fix that. I’ve come to this conclusion after using Weights & Biases to visualize the model’s gradient histograms and they look something like this:

The gradients seem to quickly converge to zero. There is a portion of code that contains a feedforward neural network that uses ReLU activation, and I changed this to Leaky ReLU under the suspicion that dying ReLU’s may be the problem. However, using Leaky ReLU’s doesn’t help and just prolongs the zero-convergence.

Any feedback on what else I may try is appreciated.