Feeding a Custom Gradient into LSTMs

Hi there,

I am currently building a combined CNN / LSTM model, where the CNN builds an input feature vector for each frame in a video sequence. Then, I loop over these input feature vectors, and feed them into an LSTMCell.
Now, after each LSTM step, I extract the last 4 elements of the hidden state, save these in a list and use them to calculate a custom gradient, which is then fed back to the hidden state after T frames.
Now, after the first sequence, I think the hidden state doesn’t get freed from the graph, as subsequent calls to .backward(gradient) don’t seem to have an effect.

In (pseudo-)code:

lstm = nn.LSTMCell(4096 + num_coords, hidden_size)

hidden = (autograd.Variable(torch.zeros(1, hidden_size)),
          autograd.Variable(torch.zeros(1, hidden_size)))

if cuda_:
    print('CUDA enabled.')
    net.cuda()
    lstm.cuda()  

optimizer = optim.Adam(list(lstm.parameters()), lr = args.learning_rate)

def detach(states):
    return [state.detach() for state in states]

if __name__ == '__main__': 
    for u in range(args.epochs):
        for i, video in tqdm(enumerate(dataloader)):
            for j in range(frames_per_video):

At this point, we build the input feature vectors later denoted as lstm_input using a pretrained CNN.

                input_lstm = build_feature_vector_using_a_CNN()
                hidden = (autograd.Variable(torch.zeros(1, hidden_size)), 
                          autograd.Variable(torch.zeros(1, hidden_size)))
                optimizer.zero_grad()
                hidden = detach(hidden)
                for u in range(len(lstm_input)):
                    hidden = lstm(lstm_input[u, :], hidden)
                    mu  = hidden[0][0, -4:] # extract last elements of each hidden layer
                    # save these mu in a list
                # CALCULATE CUSTOM_GRAD EXTERNALLY
                if j==0:
                    print('Gradient: ', CUSTOM_GRAD) # THIS GRADIENT IS LARGE
                hidden[0][0, -num_coords:].backward(CUSTOM_GRAD)
                                
                for f in lstm.parameters():

                    print('grad sum is')
                    print(np.sum(f.grad.cpu().data.numpy())) # THIS GRADIENT IS DECLINING
                optimizer.step()
                optimizer.zero_grad()

I can’t seem to solve this. My guess is that there is some element in my “external computation” that prohibits the hidden state to be detached. Why, though, does detaching not have an effect? :slight_smile:

hidden = detach(hidden)

Thanks for any help.

In your code sample, you run hidden = detach(hidden) just after creating hidden from ne Variables.

detach can only remove the existing computation graph of hidden and at the point where you call hidden there is no computation graph attached to hidden.

Thanks - that makes sense. However, this still doesn’t fix it. Is it possible that the graph survives due to my “external computation of the gradient” that uses a slice of the hidden layer - e.g. if I use

slice_of_mu = hidden[0][0, -num_coords:].cpu().data

as a baseline for these computations?

hidden is a Variable and as such, can have a computation graph.
hidden.data is the underlying Tensor that the data is stored in. Tensors do not have computation graphs.

So slice_of_mu has no computation graph attached to it.

Thanks.

So, if I delete every Variable of my model at the end of some of my loops, is there a chance a computational graph persists?

Well even without deleting the Variables, so long as your loop is written in a way that doesn’t use results from the previous iteration, the following will happen. On each successive iteration, calculations will be made using new inputs, new targets and the parameters contained in build_feature_vector_using_a_CNN and lstm. As none of these have any computation graphs attached to them, the results of those calculations will build new graphs that have no connection to the graphs of the previous iteration.

Deleting all the Variables allows Python to garbage collect all the computation graphs, but assigning new results to the same python variable names does the same thing.

For some reason, my forward pass (along with custom gradient calculation) keeps computing the gradient correctly (after feeding the inputs forward for 5 timesteps), however the backward pass works only for one epoch in the intended way - after that, the calculated gradient gets fed into the LSTM after assumingly 10 timesteps, 15 timesteps and so on, causing the gradient to vanish in an instant.

Sorry about bothering you, but is there some subtlety that I might have not considered here, especially wrt feeding in custom gradients?

The code you posted reinitialises the hidden state for every frame. This basically means that the model is run for each frame independently. In other words the LSTM has no access to any data from previous frames.

Your code creates feature vectors for each frame, and then for each frame independently, feeds the features into an LSTM one at a time.

Why not just feed the LSTM one huge feature vector for each frame as a single timestep?

Hi again jpeg,

I don’t reset the hidden state after each frame - the below part of my code iterates through the frames.

for u in range(len(lstm_input)):
    hidden = lstm(lstm_input[u, :], hidden)
    mu  = hidden[0][0, -4:] # extract last elements of each hidden layer

What you proposed would indeed probably solve my problems, however I can’t do it as described since I have to access the last 4 coordinates of the hidden state after each timestep. If I feed in a PackedSequence of T features at once, all of the intermediate hidden states are lost. I tried my best to put it in pseudocode once again:

class LSTM_(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(LSTM_, self).__init__()
        self.hidden_size = hidden_size
        self.input_size = input_size
        self.lstm = nn.LSTMCell(4096 + num_coords, hidden_size)

    def forward(self, input, hidden):
        hidden = self.lstm(input, hidden)
        return hidden

    def initHidden(self):
        return (Variable(torch.zeros(1, self.hidden_size)).cuda(), 
                    Variable(torch.zeros(1, self.hidden_size)).cuda())
 
def train(input_lstm):
    """this function trains the LSTM for exactly clip_length frames, once"""
    hidden = lstm.initHidden()
    optimizer.zero_grad()
    
    # now, iterate over a clip
    for u in range(len(lstm_input)):                            
        input_ = Variable(input_lstm[u, :])
        hidden = lstm(input_, hidden)
        mu  = hidden[0][0, -4:] # access some coordinates of hidden state
        mu_list_for_later.append(mu.cpu().data.numpy())

    grad = function_of_the_intermediate_hidden_states(mu_list_for_later)
    
    print('gradient: ',grad) 
    grad = Variable(grad)
    mu.backward(grad)

    print('grad sum is')
    print(np.sum(f.grad.cpu().data.numpy()))
    optimizer.step()

lstm = LSTM_(4096 + num_coords, hidden_size)

hidden = lstm.initHidden()

optimizer = optim.Adam(list(lstm.parameters()), lr = lr)

if __name__ == '__main__': 
    for u in range(args.epochs):
        for i, video in tqdm(enumerate(dataloader)):
            for j in range(clips_per_video):                
                input_lstm = build_feature_vector_using_a_CNN()
                train(input_lstm) # do this for many different feature vectors

The problem remains:
The line

print('gradient: ',grad) 

prints a large gradient that doesn’t change much (which is fine).
However, the term

for f in lstm.parameters():
    print('grad sum is')
    print(np.sum(f.grad.cpu().data.numpy()))

vanishes in an exponentially declining manner.

I’m sure that I make a conceptual mistake here, but I fail to find it. Once again, many thanks for your answers, they already helped me a lot towards understanding how PyTorch works. Highly appreciated.

You might find it helpful to read Understanding output of lstm and to play with a few toy examples.

You need to keep some of the “hidden state” of the LSTM, not of the “cell state”, right?
Now, if you use nn.LSTM and you feed it input of shape (timesteps, batches, features) then I get output of shape (timesteps, batches, features). The earlier timesteps of the hidden state are not discarded.

Use the batch_first=True option if you need input and output of shape (batches, timesteps, features).

So something like the following should be simpler than your code.
First some checks…

  • input_lstm is of shape (timesteps, features) right?
def train(input_lstm):
    """this function trains the LSTM for exactly clip_length frames, once"""
    hidden = None # nn.LSTM initialises the hidden state if you pass in None
    optimizer.zero_grad()
    
    # now, iterate over a clip
    input_lstm = input_lstm.unsqueeze(1) # add a batch dimension with size 1
    output, hidden = lstm(input_lstm, hidden)
    mu  = output[:,0,-num_coords:] # select [all timesteps, sample 0, the last num_coords features]
    mu_list_for_later = torch.unbind(mu.data) # split the tensor into a list along the time dimension (dim=0)

    grad = function_of_the_intermediate_hidden_states(mu_list_for_later)
    
    print('gradient: ',grad) 
    grad = Variable(grad)
    mu.backward(grad)

    print('grad sum is')
    print(np.sum(f.grad.cpu().data.numpy()))
    optimizer.step()

lstm = nn.LSTM(4096 + num_coords, hidden_size)

optimizer = optim.Adam(list(lstm.parameters()), lr = lr)

if __name__ == '__main__': 
    for u in range(args.epochs):
        for i, video in tqdm(enumerate(dataloader)):
            for j in range(clips_per_video):                
                input_lstm = build_feature_vector_using_a_CNN()
                train(input_lstm) # do this for many different feature vectors

That won’t solve the problem of the vanishing gradients, but I have an idea for that too.
On thing that seems weird to me is that you are backpropagating from only part of the output of the LSTM, in which case many of the parameters of the LSTM won’t get gradients at all, and others may be over/under-used.

Might I suggest running the output of the LSTM through a linear layer in order to squeeze hidden_size features into num_coords features. Something like this…

linear = nn.Linear(hidden_size, num_coords)
num_coords_output = linear(lstm_output)

if you give it input of shape (whatever, dimensions, you, like, hidden_size), then it will give you output of shape (whatever, dimensions, you, like, num_coords).

I hope this opens up more avenues of exploration.

Yes, indeed.

Yes again.

After reading your link, I am wondering how smth’s answer from here hidden states for all timesteps fits in. I already incorporated this suggestion and it seems to work well. Thanks

I see your point regarding the addition of a Linear Layer to be able to distribute the gradient updates in a better way. I will certainly try this tomorrow. Do you think this might be the cause of the vanishing gradients?
One more clue I’d like to add here is that decreasing the learning rate actually causes a much slower vanishing of gradients. .

I think the author wanted the “cell states” as well as the “hidden states”. nn.LSTM only returns the cell states for the last timestep.

The vocab is rather confusing. The “hidden states” are not truly hidden since they are output to the next layer as well as being used as input for the next timestep. The “cell states” are truly hidden, since they are used only by the LSTM as input for the next timestep.

I see. That makes a lot of sense to me now.

I also tried your suggestion to use a linear layer to distribute the gradients all over parameters of the LSTM - this caused the gradients to vanish more slowly (kind of what was excepted), but the error seems to be grounded somewhere else as the general tendency of the gradient to fall exponentially remained. However, I would prefer to keep it as was (the authors of the paper I’m re-implementing here did it like that - Deep Reinforcement Learning for Object Tracking), although I will try and see if this stabilizes training once I got this gradient thing working.

Regarding that, I’m a bit puzzled about this behavior at this point. The plot of the sum of LSTM gradients falling exponentially while the hidden state being reset after each sequence is pretty contradictory (this would obviously also be the case in very easy tasks where we reach some optimum really quick, but we converge to more or less random states…)

Do you have any ideas what other error this behavior might be grounded upon? I will post again in this thread for sure when I found out what happens here. Anyway, thanks for helping jpeg.