About gradients and gradient clipping on LSTM!

Hi there, I’m implementing a custom LSTM with 3 hidden layers by using LSTMCells. I know that with RNN’s we must be careful about gradient exploding so we need to use gradient clipping technique but does this apply to LSTM too?

Also, how could I see gradient values so I know which value to set in gradient clipping? And how to know when the gradient is exploding?

Thanks!!

In Sequence to Sequence Learning with Neural Networks (which might be considered a bit old by now) the authors claim:

Although LSTMs tend to not suffer from the vanishing gradient problem, they can have exploding gradients. Thus we enforced a hard constraint on the norm of the gradient [10,25] by scaling it when its norm exceeded a threshold. …

So I would assume that LSTMs can also suffer from exploding gradients.

You could register hooks for each parameter and check the gradient norm, min/max values etc. via:

for param in model.parameters():
    param.register_hook(your_check_method)

You could use the same approach as used in the linked paper, but you might also find more recent papers with more “modern” approaches.

This code is to visualize the “Vanishing gradient Problem”. If you want to visualize when and where the gradients are exploding, you can play around with the top value of plt.ylim.

plt.ylim(bottom = -0.001, top = 0.02)

import torch
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
@torch.no_grad()
def plot_grad_flow(named_params, path):

    avg_grads, max_grads, layers = [], [], []
    plt.figure(figsize = ((10,20)))
    
    for n, p in named_params:
        
        if (p.requires_grad) and ('bias' not in n):
            
            layers.append(n)
            avg_grads.append(p.grad.abs().mean())
            max_grads.append(p.grad.abs().max())
        
    plt.bar(np.arange(len(max_grads)), max_grads, alpha = 0.1, lw = 1, color = 'c')
    plt.bar(np.arange(len(max_grads)), avg_grads, alpha = 0.1, lw = 1, color = 'b')
    plt.hlines(0, 0, len(avg_grads) + 1, lw = 2, color = 'k')
    plt.xticks(range(0, len(avg_grads), 1), layers, rotation = 'vertical')
    plt.xlim(left = 0, right = len(avg_grads))
    plt.ylim(bottom = -0.001, top = 0.02) #Zoom into the lower gradient regions
    plt.xlabel('Layers')
    plt.ylabel('Average Gradients')
    plt.title('Gradient Flow')
    plt.grid(True)
    plt.legend([
        Line2D([0], [0], color = 'c', lw = 4),
        Line2D([0], [0], color = 'b', lw = 4),
        Line2D([0], [0], color = 'k', lw = 4),
        ],
        ['max-gradient', 'mean-gradient','zero-gradient'])
    
    plt.savefig(path)
    plt.close()

Output:

you can try LayerNorm instead of clipping, though output space becomes a bit odd with non-independent dimensions