Help w/ Truncated BPTT

I am trying to implement truncated backporpgation through time with an LSTM module (with k1 = k2). I’m running into a few specific problems that I’m not sure how to solve.

  1. When running loss.backward for a second time (wanting to only run on the next k1 = k2 steps) I get the following error. However because I’m calling optimizer.zero_grad(), shouldn’t that reset all the gradients in the network and allow me to start the gradients anew? Why is there a need for retain_graph in this case?

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

  1. Even if I completely comment out the backward call as I’ve done in the attached gist code, I’m receiving an out of memory error on the second iteration (shown below). I’m confused by this as the GPU has enough memory to store both the networks parameters and the training examples on the first iteration but not the second. Is the network saving some unnecessary data from the last iteration?

RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58

My full code is available here: