Always getting error while back propagating

I’m stuck on a problem for a while

Little background of question:

I pass few images into pretrained CNN to get features and then pass those extracted features into a LSTM to get a score for every image. But i’m stuck on training the LSTM using those features , i always get the error during backpropagating .

Error says that Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

I tried many solutions given on this forum and on stackoverflow as well of passing retain_graph=True , retain_variables=True etc. but none of them worked out.

Then i tried something else , i JUST REPLACED my CNN features output with a random initialised features like this
inps = [Variable(torch.randn(10, 1000) for _ in range(10)] , this inps is exactly the same as my deep features in dimension, and the below code works fine on this but NOT with frames_deep_features

Can somebody will please help me out in this question, i really need to get it done in 2/3 days

hidden_size = 32
input_size = 1000
M = 100
seq_len = 10
inps = [Variable(torch.randn(10, input_size)) for _ in range(10)]
num_layers = 2
sigma = 0.3

def my_loss(scores):
    return (scores/M) - sigma

class tryLstm(nn.Module):
    def __init__(self):
        super(tryLstm, self).__init__()
        self.rnn = nn.LSTM(input_size=input_size, hidden_size=hidden_size, num_layers=num_layers)
        self.linear = nn.Linear(hidden_size, 1)
        self.sigmoid = nn.Sigmoid()
        self.hidden = self.init_hidden()
    
    def init_hidden(self):
        return (Variable(torch.zeros(num_layers, 1, hidden_size)), Variable(torch.zeros(num_layers, 1, hidden_size)))
    
    def forward(self, inp):
        out, self.hidden = self.rnn(inp.view(seq_len, 1, -1), self.hidden)
        score = self.linear(out)
        score = self.sigmoid(score)
        return score
    
model = tryLstm()

def print_grad(g):
    print g
    
optimizer = optim.SGD(model.parameters(), lr=0.1)

for epoch in range(100):
    for i in frames_deep_features:
        model.zero_grad()
        model.hidden = model.init_hidden()
        scores = model(i)
        scores = scores.sum()
#         print scores
        loss = my_loss(scores)
#         print loss
        if epoch % 10 == 0:
            print "Loss for epoch {} is = {} ".format(epoch, loss.data[0])
#         loss.register_hook(print_grad)
        loss.backward(retain_graph=True)
#         print loss.grad
        optimizer.step()

I think you should do the following:

input_image = get_one_batch_of_input_images()
# Since you won't backpropagate through the CNN, use volatile:
input_image_var = Variable(input_image, volatile=True)
feats = cnn(input_image_var)

# Now detach features from his history as it is not needed anymore:
feats.detach_()
# Make feature non-volatile as you will backpropagate through the lstm
feats.volatile = False

# Forward the lstm
scores = your_lstm(feats)

# compute your loss
# zero grad in your_lstm
# backward your loss with loss.backward() (do not use retain_graph=True)
# step the optimizer

Hey, thanks for the fast reply

Your method worked for multiple batches as well , now can you please explain it a little bit to me , why it worked now ?

The idea is to keep in mind that when you forward elements into a network, the output has a history of all the operations that were performed (to be able to compute gradients).
In your case, the features contained all the history of what happened during the cnn forward.
Then when you use these in the lstm, the output of the lstm contains the history of what happened in the lstm and what happened in the cnn. In some sense, the history of the lstm output contains the history of the features.
When you backpropagate, the whole history is traversed and gradients are computed for Variables that requires_grad and intermediary buffers are freed to reduce memory usage.
So if you use the features multiple times, this same history will be shared by multiple output of the lstm, and the first one, when backpropagating will traverse this history and free it. But the second one will try to traverse this common part of the history as well, but that is not possible because it has been freed already.
What the detach() function does, is to detach a Variable from its history. So backpropagating will stop at this Variable. When using this, the lstm outputs will not have a shared history hence no problem.

Does that make sense?

Thanks a lot for explaning,

Here is what i got from that

For example i have a Variable a and i perform some operations like addition , subtration etc and then pass it through a model for backpropagation , so those subtraction and addition are still in history of a and it will try to backpropagate to that point , so if i dont want those operation to be considered in backpropagation, i just have to detach that variable to the last consideration point

is that right ??

Yes, that’s the idea.