Very slow backprop - 15x to 50x times slower than fw

Hello everyone! I know this isn’t the first post about slow backprop, but as I’ve been trying to figure out for days why the model I assembled (made of several publicly available NNs) has between 15x to 50x times slower backward than forward I’m hoping someone here can help me to look in the right direction.

I’m trying to build a siamese hierarchical attention network to embed text documents. The model gist:
Link to model Gist

From a high level perspective, I’m encoding sequence of characters > sequence words > sequence of sequence to have the document representation, and then using CosineEmbedding for loss.

On dummy data (docs with a few tokens and sentences max) I used the profiler and I get about 15 times slower backprop (see profiler log below ordered by time). On real data (long documents) I get ~50x slower backprop than the forward.

Profiler ordered by time:
Link to profiler report