Feeding a Custom Gradient into LSTMs

You might find it helpful to read Understanding output of lstm and to play with a few toy examples.

You need to keep some of the “hidden state” of the LSTM, not of the “cell state”, right?
Now, if you use nn.LSTM and you feed it input of shape (timesteps, batches, features) then I get output of shape (timesteps, batches, features). The earlier timesteps of the hidden state are not discarded.

Use the batch_first=True option if you need input and output of shape (batches, timesteps, features).

So something like the following should be simpler than your code.
First some checks…

  • input_lstm is of shape (timesteps, features) right?
def train(input_lstm):
    """this function trains the LSTM for exactly clip_length frames, once"""
    hidden = None # nn.LSTM initialises the hidden state if you pass in None
    optimizer.zero_grad()
    
    # now, iterate over a clip
    input_lstm = input_lstm.unsqueeze(1) # add a batch dimension with size 1
    output, hidden = lstm(input_lstm, hidden)
    mu  = output[:,0,-num_coords:] # select [all timesteps, sample 0, the last num_coords features]
    mu_list_for_later = torch.unbind(mu.data) # split the tensor into a list along the time dimension (dim=0)

    grad = function_of_the_intermediate_hidden_states(mu_list_for_later)
    
    print('gradient: ',grad) 
    grad = Variable(grad)
    mu.backward(grad)

    print('grad sum is')
    print(np.sum(f.grad.cpu().data.numpy()))
    optimizer.step()

lstm = nn.LSTM(4096 + num_coords, hidden_size)

optimizer = optim.Adam(list(lstm.parameters()), lr = lr)

if __name__ == '__main__': 
    for u in range(args.epochs):
        for i, video in tqdm(enumerate(dataloader)):
            for j in range(clips_per_video):                
                input_lstm = build_feature_vector_using_a_CNN()
                train(input_lstm) # do this for many different feature vectors

That won’t solve the problem of the vanishing gradients, but I have an idea for that too.
On thing that seems weird to me is that you are backpropagating from only part of the output of the LSTM, in which case many of the parameters of the LSTM won’t get gradients at all, and others may be over/under-used.

Might I suggest running the output of the LSTM through a linear layer in order to squeeze hidden_size features into num_coords features. Something like this…

linear = nn.Linear(hidden_size, num_coords)
num_coords_output = linear(lstm_output)

if you give it input of shape (whatever, dimensions, you, like, hidden_size), then it will give you output of shape (whatever, dimensions, you, like, num_coords).

I hope this opens up more avenues of exploration.