I am trying to demonstrate to myself the fact that the standard RNN architecture struggles to
I can read about the theory behind this and believe it to be true (Stack Exchange Answer -machine learning - Why do RNNs have a tendency to suffer from vanishing/exploding gradient? - Cross Validated / arxiv paper - Pascanu et al 2012 / Nielsen 2019 book neuralnetworksanddeeplearningcom ). But I want to create an experiment to demonstrate that it is true.
I have tried mocking up something here.
The main questions I have:
-
My hypothesis is that the increased
seq_length
makes the gradient issues worse, however I don’t see much evidence for this, despite expecting it to be true based on my reading of the above literature -
Perhaps my design of the experiment is wrong, I am using the following code to extract the gradients:
if "flat_grads" not in globals().keys():
# if True:
lstm.zero_grad()
for ix, data_dict in enumerate(tqdm(all_dl, desc="Calculating Parameter Gradients")):
x_d = data_dict["x_d"]
y = data_dict["y"]
# run forward pass
y_hat = lstm(data_dict)
loss = loss_fn(y.squeeze(), y_hat.squeeze())
loss.backward()
# get the gradients for all of the parameters
lstm_grads = [pm.grad for pm in lstm.lstm.parameters()]
flat_grads = np.concatenate([p.flatten() for p in lstm_grads])
Expecting that loss.backward()
and pm.grad
are enough to calculate the gradients of each parameter.
Is this the correct way to go about this?