First post here, forgive me if I’m breaking any conventions…
I’m trying to train a simple LSTM on time series data where the input (x) is 2-dimensional and the output (y) is 1-dimensional. I’ve set the sequence length at 60 and the batch size at 30 so that x is of size [60,30,2] and y is of size [60,30,1]. Each sequence is fed through the model one timestamp at a time, and the resulting 60 losses are averaged. I am hoping to backpropagate the gradient of this loss to do a parameter update.
for i in range(num_epochs): model.hidden = model.init_hidden() for j in range(data.n_batches): x, y = data.next_batch(0) lst = torch.zeros(1, requires_grad=True) for t in range(x.shape): y_pred = model(x[t:t+1,:,:]) lst = lst + loss_fn(y_pred, y[t].view(-1)) lst /= x.shape optimizer.zero_grad() lst.backward() optimizer.step()
This gives me the error of trying to backward through the graph a second time, and that I must specify retain_graph=True. My questions are:
- Why is retain_graph=True necessary? To my understanding, I am “unfolding” the network 60 timesteps and only doing a backward pass on the last timestep. What exactly needs to be remembered from batch to batch?
- Is there a more optimal/“better” way of doing truncated backpropagation? I was thinking I could backpropagate the loss every time one timestep is unfolded, but am not sure if that would be a big improvement. See here (https://r2rt.com/styles-of-truncated-backpropagation.html) for what I mean - specifically the picture before the section titled “Experiment design”.
- Any other comment or suggestion on code is appreciated… I’m relatively new to PyTorch so not sure what best practices are.