Notebook to demonstrate the exploding gradients for the RNN vs. LSTM

I am trying to demonstrate to myself the fact that the standard RNN architecture struggles to

I can read about the theory behind this and believe it to be true (Stack Exchange Answer -machine learning - Why do RNNs have a tendency to suffer from vanishing/exploding gradient? - Cross Validated / arxiv paper - Pascanu et al 2012 / Nielsen 2019 book neuralnetworksanddeeplearningcom ). But I want to create an experiment to demonstrate that it is true.

I have tried mocking up something here.

The main questions I have:

  1. My hypothesis is that the increased seq_length makes the gradient issues worse, however I don’t see much evidence for this, despite expecting it to be true based on my reading of the above literature

  2. Perhaps my design of the experiment is wrong, I am using the following code to extract the gradients:

if "flat_grads" not in globals().keys():
# if True:
    lstm.zero_grad()

    for ix, data_dict in enumerate(tqdm(all_dl, desc="Calculating Parameter Gradients")):
        x_d = data_dict["x_d"]
        y = data_dict["y"]

        # run forward pass
        y_hat = lstm(data_dict)
        loss = loss_fn(y.squeeze(), y_hat.squeeze())
        loss.backward()
    
    # get the gradients for all of the parameters
    lstm_grads = [pm.grad for pm in lstm.lstm.parameters()]
    flat_grads = np.concatenate([p.flatten() for p in lstm_grads])

Expecting that loss.backward() and pm.grad are enough to calculate the gradients of each parameter.

Is this the correct way to go about this?